Chat Completions
Generate AI responses for conversational applications. This endpoint is fully compatible with the OpenAI Chat Completions API.
Endpoint
POST https://api.assisters.dev/v1/chat/completions
Request Body
The model to use for completion. See available models . Examples: llama-3.1-8b, llama-3.1-70b, mistral-7b
An array of messages comprising the conversation so far. Each message object has:
role (string): system, user, or assistant
content (string): The content of the message
If true, returns a stream of Server-Sent Events (SSE) for real-time responses.
Maximum number of tokens to generate. Defaults to model’s maximum.
Sampling temperature between 0 and 2. Higher values make output more random.
Nucleus sampling parameter. Use this OR temperature, not both.
Up to 4 sequences where the API will stop generating tokens.
Penalty for new tokens based on whether they appear in the text so far. Range: -2.0 to 2.0.
Penalty for new tokens based on their frequency in the text. Range: -2.0 to 2.0.
A unique identifier for the end-user, useful for monitoring and abuse detection.
Request Examples
Basic Request
from openai import OpenAI
client = OpenAI(
api_key = "ask_your_api_key" ,
base_url = "https://api.assisters.dev/v1"
)
response = client.chat.completions.create(
model = "llama-3.1-8b" ,
messages = [
{ "role" : "system" , "content" : "You are a helpful assistant." },
{ "role" : "user" , "content" : "What is the capital of Japan?" }
]
)
print (response.choices[ 0 ].message.content)
Streaming Request
from openai import OpenAI
client = OpenAI(
api_key = "ask_your_api_key" ,
base_url = "https://api.assisters.dev/v1"
)
stream = client.chat.completions.create(
model = "llama-3.1-8b" ,
messages = [
{ "role" : "user" , "content" : "Write a short poem about coding" }
],
stream = True
)
for chunk in stream:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" , flush = True )
Multi-turn Conversation
messages = [
{ "role" : "system" , "content" : "You are a math tutor." },
{ "role" : "user" , "content" : "What is 2 + 2?" },
{ "role" : "assistant" , "content" : "2 + 2 equals 4." },
{ "role" : "user" , "content" : "And what is that multiplied by 3?" }
]
response = client.chat.completions.create(
model = "llama-3.1-8b" ,
messages = messages
)
Response
Non-Streaming Response
{
"id" : "chatcmpl-abc123xyz" ,
"object" : "chat.completion" ,
"created" : 1706745600 ,
"model" : "llama-3.1-8b" ,
"choices" : [
{
"index" : 0 ,
"message" : {
"role" : "assistant" ,
"content" : "The capital of Japan is Tokyo."
},
"finish_reason" : "stop"
}
],
"usage" : {
"prompt_tokens" : 25 ,
"completion_tokens" : 8 ,
"total_tokens" : 33
}
}
Streaming Response
Each chunk in the stream:
{
"id" : "chatcmpl-abc123xyz" ,
"object" : "chat.completion.chunk" ,
"created" : 1706745600 ,
"model" : "llama-3.1-8b" ,
"choices" : [
{
"index" : 0 ,
"delta" : {
"content" : "The"
},
"finish_reason" : null
}
]
}
Final chunk:
{
"id" : "chatcmpl-abc123xyz" ,
"object" : "chat.completion.chunk" ,
"created" : 1706745600 ,
"model" : "llama-3.1-8b" ,
"choices" : [
{
"index" : 0 ,
"delta" : {},
"finish_reason" : "stop"
}
]
}
Response Fields
Unique identifier for the completion
Always chat.completion or chat.completion.chunk for streaming
Unix timestamp of when the completion was created
The model used for completion
Array of completion choices. Each choice contains:
index: The index of this choice
message: The generated message (non-streaming)
delta: The incremental content (streaming)
finish_reason: Why generation stopped (stop, length, content_filter)
Token usage statistics (not included in streaming):
prompt_tokens: Tokens in the input
completion_tokens: Tokens in the output
total_tokens: Total tokens used
Finish Reasons
Reason Description stopNatural completion or stop sequence reached lengthmax_tokens limit reachedcontent_filterContent was filtered by moderation
Error Responses
{
"error" : {
"message" : "Invalid model specified" ,
"type" : "invalid_request_error" ,
"code" : "invalid_model"
}
}
{
"error" : {
"message" : "Invalid API key provided" ,
"type" : "authentication_error" ,
"code" : "invalid_api_key"
}
}
{
"error" : {
"message" : "Rate limit exceeded" ,
"type" : "rate_limit_error" ,
"code" : "rate_limit_exceeded"
}
}
Best Practices
Use System Messages Set context and behavior with system messages for consistent responses
Stream Long Responses Enable streaming for better UX with longer completions
Manage Conversation Length Trim old messages to stay within token limits
Handle Errors Gracefully Implement retry logic with exponential backoff