Token Counting

Tokens are the basic unit of text processing and billing in the Assisters API. Understanding tokens helps you optimize costs and manage rate limits.

What Are Tokens?

Tokens are pieces of words used by AI models. A rough rule of thumb:

1 token ≈ 4 characters in English
1 token ≈ 0.75 words in English
100 tokens ≈ 75 words

The exact tokenization depends on the model. Different models may tokenize the same text differently.

Examples

Text	Tokens
”Hello”	1
”Hello, world!“	4
”The quick brown fox”	4
”Artificial Intelligence”	2-3
”こんにちは” (Japanese)	3-5

Counting Tokens

In API Responses

Every response includes token usage:

{
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 100,
    "total_tokens": 125
  }
}

Field	Description
`prompt_tokens`	Tokens in your input (messages)
`completion_tokens`	Tokens in the model’s response
`total_tokens`	Total tokens billed

Before Sending (Estimation)

Use tiktoken to estimate tokens before making requests:

import tiktoken

def count_tokens(text, model="llama-3.1-8b"):
    # Use cl100k_base encoding (similar to Llama)
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

# Example
text = "What is the meaning of life?"
tokens = count_tokens(text)
print(f"Estimated tokens: {tokens}")  # ~7 tokens

Counting Message Tokens

Messages have overhead beyond just the content:

def count_message_tokens(messages):
    encoding = tiktoken.get_encoding("cl100k_base")

    total = 0
    for message in messages:
        # Message overhead (role, formatting)
        total += 4

        # Content tokens
        total += len(encoding.encode(message.get("content", "")))

        # Name (if present)
        if "name" in message:
            total += len(encoding.encode(message["name"]))
            total += 1  # Name field overhead

    # Conversation overhead
    total += 2

    return total

# Example
messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello!"}
]
print(f"Total: {count_message_tokens(messages)} tokens")

Billing Calculation

You’re billed for total tokens (input + output):

Cost = (prompt_tokens + completion_tokens) × price_per_million / 1,000,000

Example Calculation

# Llama 3.1 8B pricing: $0.10 per million tokens

prompt_tokens = 500
completion_tokens = 1000
total_tokens = 1500

price_per_million = 0.10
cost = total_tokens * price_per_million / 1_000_000

print(f"Cost: ${cost:.6f}")  # $0.000150

Monthly Usage Estimation

def estimate_monthly_cost(requests_per_day, avg_tokens_per_request, price_per_million):
    daily_tokens = requests_per_day * avg_tokens_per_request
    monthly_tokens = daily_tokens * 30
    monthly_cost = monthly_tokens * price_per_million / 1_000_000

    return {
        "daily_tokens": daily_tokens,
        "monthly_tokens": monthly_tokens,
        "monthly_cost": monthly_cost
    }

# Example: 1000 requests/day, 500 tokens each, $0.10/M
estimate = estimate_monthly_cost(1000, 500, 0.10)
print(f"Monthly cost: ${estimate['monthly_cost']:.2f}")  # $1.50

Token Limits

Context Window

Each model has a maximum context window:

Model	Context Window
`llama-3.1-8b`	128,000 tokens
`llama-3.1-70b`	128,000 tokens
`mistral-7b`	32,000 tokens
`phi-3-mini`	4,000 tokens

If your input exceeds the context window, the request will fail with a context_length_exceeded error.

Output Limits

You can limit output tokens with max_tokens:

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Write an essay"}],
    max_tokens=500  # Limit response length
)

Optimizing Token Usage

1. Trim Conversation History

Keep only recent messages:

def trim_messages(messages, max_tokens=4000):
    encoding = tiktoken.get_encoding("cl100k_base")

    # Always keep system message
    result = []
    if messages and messages[0]["role"] == "system":
        result.append(messages[0])
        messages = messages[1:]

    current_tokens = count_message_tokens(result)

    # Add messages from most recent, stop when limit reached
    for msg in reversed(messages):
        msg_tokens = len(encoding.encode(msg["content"])) + 4
        if current_tokens + msg_tokens > max_tokens:
            break
        result.insert(len([m for m in result if m["role"] == "system"]), msg)
        current_tokens += msg_tokens

    return result

2. Summarize Long Contexts

def summarize_if_long(text, max_tokens=2000):
    tokens = count_tokens(text)

    if tokens <= max_tokens:
        return text

    # Summarize with AI
    response = client.chat.completions.create(
        model="llama-3.1-8b",
        messages=[
            {"role": "system", "content": "Summarize concisely:"},
            {"role": "user", "content": text}
        ],
        max_tokens=max_tokens // 2
    )

    return response.choices[0].message.content

3. Use Efficient Prompts

# Verbose (more tokens)
messages = [{
    "role": "user",
    "content": "I would like you to please provide me with a comprehensive and detailed explanation of what machine learning is and how it works in general terms."
}]

# Concise (fewer tokens)
messages = [{
    "role": "user",
    "content": "Explain machine learning briefly."
}]

4. Cache Responses

Don’t re-request the same information:

from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def cached_completion(prompt_hash):
    # Implementation
    pass

def get_completion(messages):
    # Create hash of messages for cache key
    msg_str = str(messages)
    prompt_hash = hashlib.md5(msg_str.encode()).hexdigest()

    return cached_completion(prompt_hash)

Tracking Usage

Per-Request Tracking

class UsageTracker:
    def __init__(self):
        self.total_tokens = 0
        self.total_cost = 0
        self.request_count = 0

    def record(self, usage, price_per_million):
        tokens = usage.total_tokens
        cost = tokens * price_per_million / 1_000_000

        self.total_tokens += tokens
        self.total_cost += cost
        self.request_count += 1

    def report(self):
        return {
            "requests": self.request_count,
            "tokens": self.total_tokens,
            "cost": f"${self.total_cost:.4f}"
        }

# Usage
tracker = UsageTracker()

response = client.chat.completions.create(...)
tracker.record(response.usage, price_per_million=0.10)

print(tracker.report())

Dashboard Monitoring

Check your usage in real-time at assisters.dev/dashboard/usage.

Token Pricing Reference

Model	Price per Million Tokens
`llama-3.1-8b`	$0.10
`llama-3.1-70b`	$0.90
`mistral-7b`	$0.10
`e5-large-v2`	$0.01
`llama-guard-3`	$0.20
`bge-reranker-v2`	$0.05

Full Pricing Details

See complete pricing for all models and tiers

Getting Started

Guides

SDKs

Billing

Security

Token Counting

Token Counting

What Are Tokens?

Examples

Counting Tokens

In API Responses

Before Sending (Estimation)

Counting Message Tokens

Billing Calculation

Example Calculation

Monthly Usage Estimation

Token Limits

Context Window

Output Limits

Optimizing Token Usage

1. Trim Conversation History

2. Summarize Long Contexts

3. Use Efficient Prompts

4. Cache Responses

Tracking Usage

Per-Request Tracking

Dashboard Monitoring

Token Pricing Reference

Full Pricing Details

Getting Started

Guides

SDKs

Billing

Security

​Token Counting

​What Are Tokens?

​Examples

​Counting Tokens

​In API Responses

​Before Sending (Estimation)

​Counting Message Tokens

​Billing Calculation

​Example Calculation

​Monthly Usage Estimation

​Token Limits

​Context Window

​Output Limits

​Optimizing Token Usage

​1. Trim Conversation History

​2. Summarize Long Contexts

​3. Use Efficient Prompts

​4. Cache Responses

​Tracking Usage

​Per-Request Tracking

​Dashboard Monitoring

​Token Pricing Reference

Full Pricing Details

Token Counting

What Are Tokens?

Examples

Counting Tokens

In API Responses

Before Sending (Estimation)

Counting Message Tokens

Billing Calculation

Example Calculation

Monthly Usage Estimation

Token Limits

Context Window

Output Limits

Optimizing Token Usage

1. Trim Conversation History

2. Summarize Long Contexts

3. Use Efficient Prompts

4. Cache Responses

Tracking Usage

Per-Request Tracking

Dashboard Monitoring

Token Pricing Reference