Chat Models

Generate conversational responses with our selection of open-source chat models. All models support streaming and are OpenAI-compatible.

Available Models

Llama 3.3 70B (FREE - Recommended)

Meta’s latest 70 billion parameter model with state-of-the-art reasoning. FREE via Groq.

Specification	Value
Provider	Meta (via Groq)
Parameters	70B
Context Window	128,000 tokens
Max Output	8,192 tokens
Price	FREE
Latency	~200ms first token

Best for:

Complex reasoning tasks
Creative writing
Advanced code generation
All general-purpose chat applications

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Free Model: This is our recommended model for all chat use cases. It offers superior quality at zero cost via Groq’s free inference tier.

Llama 3.1 8B

Meta’s efficient 8 billion parameter model with excellent performance for most tasks.

Specification	Value
Provider	Meta
Parameters	8B
Context Window	128,000 tokens
Max Output	8,192 tokens
Price	$0.10 / million tokens
Latency	~100ms first token

Best for:

General chat applications
Code generation
Summarization
Q&A systems

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Llama 3.1 70B

Meta’s flagship 70 billion parameter model for complex reasoning and high-quality outputs.

Specification	Value
Provider	Meta
Parameters	70B
Context Window	128,000 tokens
Max Output	8,192 tokens
Price	$0.90 / million tokens
Latency	~500ms first token

Best for:

Complex reasoning tasks
Creative writing
Advanced code generation
Detailed analysis

response = client.chat.completions.create(
    model="llama-3.1-70b",
    messages=[{"role": "user", "content": "Write a detailed business plan"}]
)

Mistral 7B

Mistral AI’s efficient model known for strong reasoning and coding capabilities.

Specification	Value
Provider	Mistral AI
Parameters	7B
Context Window	32,000 tokens
Max Output	4,096 tokens
Price	$0.10 / million tokens
Latency	~80ms first token

Best for:

Code completion
Reasoning tasks
Multilingual applications
Cost-effective deployment

response = client.chat.completions.create(
    model="mistral-7b",
    messages=[{"role": "user", "content": "Write a Python function to sort a list"}]
)

Qwen2 7B

Alibaba’s Qwen 2 model with strong multilingual and mathematical capabilities.

Specification	Value
Provider	Alibaba
Parameters	7B
Context Window	32,000 tokens
Max Output	4,096 tokens
Price	$0.10 / million tokens
Latency	~90ms first token

Best for:

Multilingual applications
Mathematical reasoning
Chinese language tasks
Code generation

response = client.chat.completions.create(
    model="qwen2-7b",
    messages=[{"role": "user", "content": "Solve this math problem: 2x + 5 = 13"}]
)

Gemma 2 9B

Google’s efficient model built on Gemini technology.

Specification	Value
Provider	Google
Parameters	9B
Context Window	8,000 tokens
Max Output	2,048 tokens
Price	$0.15 / million tokens
Latency	~100ms first token

Best for:

General conversation
Text classification
Summarization
On-device deployment

response = client.chat.completions.create(
    model="gemma-2-9b",
    messages=[{"role": "user", "content": "Summarize this article"}]
)

Phi-3 Mini

Microsoft’s compact model optimized for speed and efficiency.

Specification	Value
Provider	Microsoft
Parameters	3.8B
Context Window	4,000 tokens
Max Output	2,048 tokens
Price	$0.08 / million tokens
Latency	~50ms first token

Best for:

High-volume applications
Simple tasks
Real-time responses
Cost-sensitive deployments

response = client.chat.completions.create(
    model="phi-3-mini",
    messages=[{"role": "user", "content": "What is 2 + 2?"}]
)

Model Comparison

Model	Quality	Speed	Context	Price
`llama-3.3-70b`	★★★★★	★★★★☆	128K	FREE
`llama-3.1-70b`	★★★★★	★★☆☆☆	128K	$0.90/M
`llama-3.1-8b`	★★★★☆	★★★★☆	128K	$0.10/M
`mistral-7b`	★★★★☆	★★★★☆	32K	$0.10/M
`qwen2-7b`	★★★★☆	★★★★☆	32K	$0.10/M
`gemma-2-9b`	★★★☆☆	★★★★☆	8K	$0.15/M
`phi-3-mini`	★★★☆☆	★★★★★	4K	$0.08/M

Common Parameters

All chat models support these parameters:

Parameter	Type	Description
`messages`	array	Conversation history
`temperature`	float	Randomness (0-2)
`max_tokens`	int	Maximum output length
`top_p`	float	Nucleus sampling
`stream`	bool	Enable streaming
`stop`	array	Stop sequences

Usage Tips

Use System Messages

Set behavior and context with system messages for consistent results

Enable Streaming

Use stream=true for better UX with longer responses

Manage Context

Trim old messages to stay within context limits

Adjust Temperature

Lower for factual tasks, higher for creative writing

Choosing a Model

Recommendation: Start with llama-3.3-70b for most use cases. It’s free via Groq, has the best quality, and offers 128K context.

Model Catalog

​Chat Models

​Available Models

​Llama 3.3 70B (FREE - Recommended)

​Llama 3.1 8B

​Llama 3.1 70B

​Mistral 7B

​Qwen2 7B

​Gemma 2 9B

​Phi-3 Mini

​Model Comparison

​Common Parameters

​Usage Tips

Use System Messages

Enable Streaming

Manage Context

Adjust Temperature

​Choosing a Model

Chat Models

Available Models

Llama 3.3 70B (FREE - Recommended)

Llama 3.1 8B

Llama 3.1 70B

Mistral 7B

Qwen2 7B

Gemma 2 9B

Phi-3 Mini

Model Comparison

Common Parameters

Usage Tips

Choosing a Model