Chat Models
Generate conversational responses with our selection of open-source chat models. All models support streaming and are OpenAI-compatible.
Available Models
Llama 3.3 70B (FREE - Recommended)
Meta’s latest 70 billion parameter model with state-of-the-art reasoning. FREE via Groq.
| Specification | Value |
|---|
| Provider | Meta (via Groq) |
| Parameters | 70B |
| Context Window | 128,000 tokens |
| Max Output | 8,192 tokens |
| Price | FREE |
| Latency | ~200ms first token |
Best for:
- Complex reasoning tasks
- Creative writing
- Advanced code generation
- All general-purpose chat applications
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
Free Model: This is our recommended model for all chat use cases. It offers superior quality at zero cost via Groq’s free inference tier.
Llama 3.1 8B
Meta’s efficient 8 billion parameter model with excellent performance for most tasks.
| Specification | Value |
|---|
| Provider | Meta |
| Parameters | 8B |
| Context Window | 128,000 tokens |
| Max Output | 8,192 tokens |
| Price | $0.10 / million tokens |
| Latency | ~100ms first token |
Best for:
- General chat applications
- Code generation
- Summarization
- Q&A systems
response = client.chat.completions.create(
model="llama-3.1-8b",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
Llama 3.1 70B
Meta’s flagship 70 billion parameter model for complex reasoning and high-quality outputs.
| Specification | Value |
|---|
| Provider | Meta |
| Parameters | 70B |
| Context Window | 128,000 tokens |
| Max Output | 8,192 tokens |
| Price | $0.90 / million tokens |
| Latency | ~500ms first token |
Best for:
- Complex reasoning tasks
- Creative writing
- Advanced code generation
- Detailed analysis
response = client.chat.completions.create(
model="llama-3.1-70b",
messages=[{"role": "user", "content": "Write a detailed business plan"}]
)
Mistral 7B
Mistral AI’s efficient model known for strong reasoning and coding capabilities.
| Specification | Value |
|---|
| Provider | Mistral AI |
| Parameters | 7B |
| Context Window | 32,000 tokens |
| Max Output | 4,096 tokens |
| Price | $0.10 / million tokens |
| Latency | ~80ms first token |
Best for:
- Code completion
- Reasoning tasks
- Multilingual applications
- Cost-effective deployment
response = client.chat.completions.create(
model="mistral-7b",
messages=[{"role": "user", "content": "Write a Python function to sort a list"}]
)
Qwen2 7B
Alibaba’s Qwen 2 model with strong multilingual and mathematical capabilities.
| Specification | Value |
|---|
| Provider | Alibaba |
| Parameters | 7B |
| Context Window | 32,000 tokens |
| Max Output | 4,096 tokens |
| Price | $0.10 / million tokens |
| Latency | ~90ms first token |
Best for:
- Multilingual applications
- Mathematical reasoning
- Chinese language tasks
- Code generation
response = client.chat.completions.create(
model="qwen2-7b",
messages=[{"role": "user", "content": "Solve this math problem: 2x + 5 = 13"}]
)
Gemma 2 9B
Google’s efficient model built on Gemini technology.
| Specification | Value |
|---|
| Provider | Google |
| Parameters | 9B |
| Context Window | 8,000 tokens |
| Max Output | 2,048 tokens |
| Price | $0.15 / million tokens |
| Latency | ~100ms first token |
Best for:
- General conversation
- Text classification
- Summarization
- On-device deployment
response = client.chat.completions.create(
model="gemma-2-9b",
messages=[{"role": "user", "content": "Summarize this article"}]
)
Phi-3 Mini
Microsoft’s compact model optimized for speed and efficiency.
| Specification | Value |
|---|
| Provider | Microsoft |
| Parameters | 3.8B |
| Context Window | 4,000 tokens |
| Max Output | 2,048 tokens |
| Price | $0.08 / million tokens |
| Latency | ~50ms first token |
Best for:
- High-volume applications
- Simple tasks
- Real-time responses
- Cost-sensitive deployments
response = client.chat.completions.create(
model="phi-3-mini",
messages=[{"role": "user", "content": "What is 2 + 2?"}]
)
Model Comparison
| Model | Quality | Speed | Context | Price |
|---|
llama-3.3-70b | ★★★★★ | ★★★★☆ | 128K | FREE |
llama-3.1-70b | ★★★★★ | ★★☆☆☆ | 128K | $0.90/M |
llama-3.1-8b | ★★★★☆ | ★★★★☆ | 128K | $0.10/M |
mistral-7b | ★★★★☆ | ★★★★☆ | 32K | $0.10/M |
qwen2-7b | ★★★★☆ | ★★★★☆ | 32K | $0.10/M |
gemma-2-9b | ★★★☆☆ | ★★★★☆ | 8K | $0.15/M |
phi-3-mini | ★★★☆☆ | ★★★★★ | 4K | $0.08/M |
Common Parameters
All chat models support these parameters:
| Parameter | Type | Description |
|---|
messages | array | Conversation history |
temperature | float | Randomness (0-2) |
max_tokens | int | Maximum output length |
top_p | float | Nucleus sampling |
stream | bool | Enable streaming |
stop | array | Stop sequences |
Usage Tips
Use System Messages
Set behavior and context with system messages for consistent results
Enable Streaming
Use stream=true for better UX with longer responses
Manage Context
Trim old messages to stay within context limits
Adjust Temperature
Lower for factual tasks, higher for creative writing
Choosing a Model
Recommendation: Start with llama-3.3-70b for most use cases. It’s free via Groq, has the best quality, and offers 128K context.