Skip to main content

Chat Models

Generate conversational responses with our selection of open-source chat models. All models support streaming and are OpenAI-compatible.

Available Models

Meta’s latest 70 billion parameter model with state-of-the-art reasoning. FREE via Groq.
SpecificationValue
ProviderMeta (via Groq)
Parameters70B
Context Window128,000 tokens
Max Output8,192 tokens
PriceFREE
Latency~200ms first token
Best for:
  • Complex reasoning tasks
  • Creative writing
  • Advanced code generation
  • All general-purpose chat applications
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
Free Model: This is our recommended model for all chat use cases. It offers superior quality at zero cost via Groq’s free inference tier.

Llama 3.1 8B

Meta’s efficient 8 billion parameter model with excellent performance for most tasks.
SpecificationValue
ProviderMeta
Parameters8B
Context Window128,000 tokens
Max Output8,192 tokens
Price$0.10 / million tokens
Latency~100ms first token
Best for:
  • General chat applications
  • Code generation
  • Summarization
  • Q&A systems
response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Llama 3.1 70B

Meta’s flagship 70 billion parameter model for complex reasoning and high-quality outputs.
SpecificationValue
ProviderMeta
Parameters70B
Context Window128,000 tokens
Max Output8,192 tokens
Price$0.90 / million tokens
Latency~500ms first token
Best for:
  • Complex reasoning tasks
  • Creative writing
  • Advanced code generation
  • Detailed analysis
response = client.chat.completions.create(
    model="llama-3.1-70b",
    messages=[{"role": "user", "content": "Write a detailed business plan"}]
)

Mistral 7B

Mistral AI’s efficient model known for strong reasoning and coding capabilities.
SpecificationValue
ProviderMistral AI
Parameters7B
Context Window32,000 tokens
Max Output4,096 tokens
Price$0.10 / million tokens
Latency~80ms first token
Best for:
  • Code completion
  • Reasoning tasks
  • Multilingual applications
  • Cost-effective deployment
response = client.chat.completions.create(
    model="mistral-7b",
    messages=[{"role": "user", "content": "Write a Python function to sort a list"}]
)

Qwen2 7B

Alibaba’s Qwen 2 model with strong multilingual and mathematical capabilities.
SpecificationValue
ProviderAlibaba
Parameters7B
Context Window32,000 tokens
Max Output4,096 tokens
Price$0.10 / million tokens
Latency~90ms first token
Best for:
  • Multilingual applications
  • Mathematical reasoning
  • Chinese language tasks
  • Code generation
response = client.chat.completions.create(
    model="qwen2-7b",
    messages=[{"role": "user", "content": "Solve this math problem: 2x + 5 = 13"}]
)

Gemma 2 9B

Google’s efficient model built on Gemini technology.
SpecificationValue
ProviderGoogle
Parameters9B
Context Window8,000 tokens
Max Output2,048 tokens
Price$0.15 / million tokens
Latency~100ms first token
Best for:
  • General conversation
  • Text classification
  • Summarization
  • On-device deployment
response = client.chat.completions.create(
    model="gemma-2-9b",
    messages=[{"role": "user", "content": "Summarize this article"}]
)

Phi-3 Mini

Microsoft’s compact model optimized for speed and efficiency.
SpecificationValue
ProviderMicrosoft
Parameters3.8B
Context Window4,000 tokens
Max Output2,048 tokens
Price$0.08 / million tokens
Latency~50ms first token
Best for:
  • High-volume applications
  • Simple tasks
  • Real-time responses
  • Cost-sensitive deployments
response = client.chat.completions.create(
    model="phi-3-mini",
    messages=[{"role": "user", "content": "What is 2 + 2?"}]
)

Model Comparison

ModelQualitySpeedContextPrice
llama-3.3-70b★★★★★★★★★☆128KFREE
llama-3.1-70b★★★★★★★☆☆☆128K$0.90/M
llama-3.1-8b★★★★☆★★★★☆128K$0.10/M
mistral-7b★★★★☆★★★★☆32K$0.10/M
qwen2-7b★★★★☆★★★★☆32K$0.10/M
gemma-2-9b★★★☆☆★★★★☆8K$0.15/M
phi-3-mini★★★☆☆★★★★★4K$0.08/M

Common Parameters

All chat models support these parameters:
ParameterTypeDescription
messagesarrayConversation history
temperaturefloatRandomness (0-2)
max_tokensintMaximum output length
top_pfloatNucleus sampling
streamboolEnable streaming
stoparrayStop sequences

Usage Tips

Use System Messages

Set behavior and context with system messages for consistent results

Enable Streaming

Use stream=true for better UX with longer responses

Manage Context

Trim old messages to stay within context limits

Adjust Temperature

Lower for factual tasks, higher for creative writing

Choosing a Model

Recommendation: Start with llama-3.3-70b for most use cases. It’s free via Groq, has the best quality, and offers 128K context.