Skip to main content

Streaming Responses

Enable streaming for real-time, token-by-token responses. This improves perceived latency and user experience for chat applications.

How Streaming Works

Without streaming, you wait for the entire response:
[Request] ────────────────────────────> [Full Response]
          <---- 2-5 seconds ---->
With streaming, tokens arrive as they’re generated:
[Request] → [Token] → [Token] → [Token] → [Done]
          <50ms>   <50ms>   <50ms>

Enabling Streaming

Set stream=true in your request:
from openai import OpenAI

client = OpenAI(
    api_key="ask_your_key",
    base_url="https://api.assisters.dev/v1"
)

stream = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Write a poem about coding"}],
    stream=True  # Enable streaming
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Stream Response Format

Streaming uses Server-Sent Events (SSE). Each event is a JSON object:
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"}}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":" world"}}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{}},"finish_reason":"stop"]}

data: [DONE]

Chunk Structure

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion.chunk",
  "created": 1706745600,
  "model": "llama-3.1-8b",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant",  // Only in first chunk
        "content": "Hello"    // Token content
      },
      "finish_reason": null   // "stop" in final chunk
    }
  ]
}

Web Application Example

React Hook

import { useState, useCallback } from 'react';

function useStreamingChat() {
  const [response, setResponse] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);

  const sendMessage = useCallback(async (message) => {
    setIsStreaming(true);
    setResponse('');

    const res = await fetch('https://api.assisters.dev/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Authorization': 'Bearer ask_your_key',
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'llama-3.1-8b',
        messages: [{ role: 'user', content: message }],
        stream: true
      })
    });

    const reader = res.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const chunk = decoder.decode(value);
      const lines = chunk.split('\n').filter(line => line.startsWith('data:'));

      for (const line of lines) {
        const data = line.slice(5).trim();
        if (data === '[DONE]') continue;

        try {
          const json = JSON.parse(data);
          const content = json.choices[0]?.delta?.content || '';
          setResponse(prev => prev + content);
        } catch (e) {
          // Skip malformed chunks
        }
      }
    }

    setIsStreaming(false);
  }, []);

  return { response, isStreaming, sendMessage };
}

Usage

function ChatComponent() {
  const { response, isStreaming, sendMessage } = useStreamingChat();

  return (
    <div>
      <button onClick={() => sendMessage('Hello!')}>
        Send
      </button>
      <div>{response}</div>
      {isStreaming && <span>...</span>}
    </div>
  );
}

Python Async Streaming

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    api_key="ask_your_key",
    base_url="https://api.assisters.dev/v1"
)

async def stream_chat(message):
    stream = await client.chat.completions.create(
        model="llama-3.1-8b",
        messages=[{"role": "user", "content": message}],
        stream=True
    )

    full_response = ""
    async for chunk in stream:
        content = chunk.choices[0].delta.content
        if content:
            full_response += content
            print(content, end="", flush=True)

    return full_response

# Run
asyncio.run(stream_chat("Tell me a story"))

FastAPI Streaming

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI(api_key="ask_your_key", base_url="https://api.assisters.dev/v1")

@app.post("/chat")
async def chat(message: str):
    async def generate():
        stream = client.chat.completions.create(
            model="llama-3.1-8b",
            messages=[{"role": "user", "content": message}],
            stream=True
        )

        for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield f"data: {content}\n\n"

        yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

Handling Stream Errors

from openai import OpenAI, APIError

client = OpenAI(api_key="ask_your_key", base_url="https://api.assisters.dev/v1")

def safe_stream(messages):
    try:
        stream = client.chat.completions.create(
            model="llama-3.1-8b",
            messages=messages,
            stream=True
        )

        for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield content

    except APIError as e:
        yield f"\n\n[Error: {e.message}]"

Token Counting with Streams

Streaming responses don’t include usage stats until the end:
def stream_with_usage(messages):
    stream = client.chat.completions.create(
        model="llama-3.1-8b",
        messages=messages,
        stream=True,
        stream_options={"include_usage": True}  # Request usage in final chunk
    )

    full_response = ""
    usage = None

    for chunk in stream:
        if chunk.choices:
            content = chunk.choices[0].delta.content
            if content:
                full_response += content
                print(content, end="")

        # Usage appears in final chunk
        if hasattr(chunk, 'usage') and chunk.usage:
            usage = chunk.usage

    print(f"\n\nTokens used: {usage.total_tokens if usage else 'unknown'}")
    return full_response

Best Practices

Always Use for Chat

Streaming dramatically improves UX for conversational interfaces

Handle Disconnects

Implement reconnection logic for long responses

Buffer Display

Display tokens as they arrive, don’t wait for full words

Show Typing Indicator

Show users that a response is being generated

When Not to Stream

Streaming isn’t always the best choice:
Use CaseRecommendation
Chat interfaces✅ Stream
Batch processing❌ Don’t stream
Short responsesEither works
JSON extraction❌ Don’t stream
Background tasks❌ Don’t stream

Debugging Streams

Log stream events for debugging:
import json

def debug_stream(messages):
    stream = client.chat.completions.create(
        model="llama-3.1-8b",
        messages=messages,
        stream=True
    )

    for i, chunk in enumerate(stream):
        print(f"Chunk {i}: {json.dumps(chunk.model_dump(), indent=2)}")