Skip to main content

Token Counting

Tokens are the basic unit of text processing and billing in the Assisters API. Understanding tokens helps you optimize costs and manage rate limits.

What Are Tokens?

Tokens are pieces of words used by AI models. A rough rule of thumb:
  • 1 token ≈ 4 characters in English
  • 1 token ≈ 0.75 words in English
  • 100 tokens ≈ 75 words
The exact tokenization depends on the model. Different models may tokenize the same text differently.

Examples

TextTokens
”Hello”1
”Hello, world!“4
”The quick brown fox”4
”Artificial Intelligence”2-3
”こんにちは” (Japanese)3-5

Counting Tokens

In API Responses

Every response includes token usage:
{
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 100,
    "total_tokens": 125
  }
}
FieldDescription
prompt_tokensTokens in your input (messages)
completion_tokensTokens in the model’s response
total_tokensTotal tokens billed

Before Sending (Estimation)

Use tiktoken to estimate tokens before making requests:
import tiktoken

def count_tokens(text, model="llama-3.1-8b"):
    # Use cl100k_base encoding (similar to Llama)
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

# Example
text = "What is the meaning of life?"
tokens = count_tokens(text)
print(f"Estimated tokens: {tokens}")  # ~7 tokens

Counting Message Tokens

Messages have overhead beyond just the content:
def count_message_tokens(messages):
    encoding = tiktoken.get_encoding("cl100k_base")

    total = 0
    for message in messages:
        # Message overhead (role, formatting)
        total += 4

        # Content tokens
        total += len(encoding.encode(message.get("content", "")))

        # Name (if present)
        if "name" in message:
            total += len(encoding.encode(message["name"]))
            total += 1  # Name field overhead

    # Conversation overhead
    total += 2

    return total

# Example
messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello!"}
]
print(f"Total: {count_message_tokens(messages)} tokens")

Billing Calculation

You’re billed for total tokens (input + output):
Cost = (prompt_tokens + completion_tokens) × price_per_million / 1,000,000

Example Calculation

# Llama 3.1 8B pricing: $0.10 per million tokens

prompt_tokens = 500
completion_tokens = 1000
total_tokens = 1500

price_per_million = 0.10
cost = total_tokens * price_per_million / 1_000_000

print(f"Cost: ${cost:.6f}")  # $0.000150

Monthly Usage Estimation

def estimate_monthly_cost(requests_per_day, avg_tokens_per_request, price_per_million):
    daily_tokens = requests_per_day * avg_tokens_per_request
    monthly_tokens = daily_tokens * 30
    monthly_cost = monthly_tokens * price_per_million / 1_000_000

    return {
        "daily_tokens": daily_tokens,
        "monthly_tokens": monthly_tokens,
        "monthly_cost": monthly_cost
    }

# Example: 1000 requests/day, 500 tokens each, $0.10/M
estimate = estimate_monthly_cost(1000, 500, 0.10)
print(f"Monthly cost: ${estimate['monthly_cost']:.2f}")  # $1.50

Token Limits

Context Window

Each model has a maximum context window:
ModelContext Window
llama-3.1-8b128,000 tokens
llama-3.1-70b128,000 tokens
mistral-7b32,000 tokens
phi-3-mini4,000 tokens
If your input exceeds the context window, the request will fail with a context_length_exceeded error.

Output Limits

You can limit output tokens with max_tokens:
response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Write an essay"}],
    max_tokens=500  # Limit response length
)

Optimizing Token Usage

1. Trim Conversation History

Keep only recent messages:
def trim_messages(messages, max_tokens=4000):
    encoding = tiktoken.get_encoding("cl100k_base")

    # Always keep system message
    result = []
    if messages and messages[0]["role"] == "system":
        result.append(messages[0])
        messages = messages[1:]

    current_tokens = count_message_tokens(result)

    # Add messages from most recent, stop when limit reached
    for msg in reversed(messages):
        msg_tokens = len(encoding.encode(msg["content"])) + 4
        if current_tokens + msg_tokens > max_tokens:
            break
        result.insert(len([m for m in result if m["role"] == "system"]), msg)
        current_tokens += msg_tokens

    return result

2. Summarize Long Contexts

def summarize_if_long(text, max_tokens=2000):
    tokens = count_tokens(text)

    if tokens <= max_tokens:
        return text

    # Summarize with AI
    response = client.chat.completions.create(
        model="llama-3.1-8b",
        messages=[
            {"role": "system", "content": "Summarize concisely:"},
            {"role": "user", "content": text}
        ],
        max_tokens=max_tokens // 2
    )

    return response.choices[0].message.content

3. Use Efficient Prompts

# Verbose (more tokens)
messages = [{
    "role": "user",
    "content": "I would like you to please provide me with a comprehensive and detailed explanation of what machine learning is and how it works in general terms."
}]

# Concise (fewer tokens)
messages = [{
    "role": "user",
    "content": "Explain machine learning briefly."
}]

4. Cache Responses

Don’t re-request the same information:
from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def cached_completion(prompt_hash):
    # Implementation
    pass

def get_completion(messages):
    # Create hash of messages for cache key
    msg_str = str(messages)
    prompt_hash = hashlib.md5(msg_str.encode()).hexdigest()

    return cached_completion(prompt_hash)

Tracking Usage

Per-Request Tracking

class UsageTracker:
    def __init__(self):
        self.total_tokens = 0
        self.total_cost = 0
        self.request_count = 0

    def record(self, usage, price_per_million):
        tokens = usage.total_tokens
        cost = tokens * price_per_million / 1_000_000

        self.total_tokens += tokens
        self.total_cost += cost
        self.request_count += 1

    def report(self):
        return {
            "requests": self.request_count,
            "tokens": self.total_tokens,
            "cost": f"${self.total_cost:.4f}"
        }

# Usage
tracker = UsageTracker()

response = client.chat.completions.create(...)
tracker.record(response.usage, price_per_million=0.10)

print(tracker.report())

Dashboard Monitoring

Check your usage in real-time at assisters.dev/dashboard/usage.

Token Pricing Reference

ModelPrice per Million Tokens
llama-3.1-8b$0.10
llama-3.1-70b$0.90
mistral-7b$0.10
e5-large-v2$0.01
llama-guard-3$0.20
bge-reranker-v2$0.05

Full Pricing Details

See complete pricing for all models and tiers