Streaming Responses
Enable streaming for real-time, token-by-token responses. This improves perceived latency and user experience for chat applications.How Streaming Works
Without streaming, you wait for the entire response:Enabling Streaming
Setstream=true in your request:
Stream Response Format
Streaming uses Server-Sent Events (SSE). Each event is a JSON object:Chunk Structure
Web Application Example
React Hook
Usage
Python Async Streaming
FastAPI Streaming
Handling Stream Errors
Token Counting with Streams
Streaming responses don’t include usage stats until the end:Best Practices
Always Use for Chat
Streaming dramatically improves UX for conversational interfaces
Handle Disconnects
Implement reconnection logic for long responses
Buffer Display
Display tokens as they arrive, don’t wait for full words
Show Typing Indicator
Show users that a response is being generated
When Not to Stream
Streaming isn’t always the best choice:| Use Case | Recommendation |
|---|---|
| Chat interfaces | ✅ Stream |
| Batch processing | ❌ Don’t stream |
| Short responses | Either works |
| JSON extraction | ❌ Don’t stream |
| Background tasks | ❌ Don’t stream |