Token Counting
Tokens are the basic unit of text processing and billing in the Assisters API. Understanding tokens helps you optimize costs and manage rate limits.
What Are Tokens?
Tokens are pieces of words used by AI models. A rough rule of thumb:
1 token ≈ 4 characters in English
1 token ≈ 0.75 words in English
100 tokens ≈ 75 words
The exact tokenization depends on the model. Different models may tokenize the same text differently.
Examples
Text Tokens ”Hello” 1 ”Hello, world!“ 4 ”The quick brown fox” 4 ”Artificial Intelligence” 2-3 ”こんにちは” (Japanese) 3-5
Counting Tokens
In API Responses
Every response includes token usage:
{
"usage" : {
"prompt_tokens" : 25 ,
"completion_tokens" : 100 ,
"total_tokens" : 125
}
}
Field Description prompt_tokensTokens in your input (messages) completion_tokensTokens in the model’s response total_tokensTotal tokens billed
Before Sending (Estimation)
Use tiktoken to estimate tokens before making requests:
import tiktoken
def count_tokens ( text , model = "llama-3.1-8b" ):
# Use cl100k_base encoding (similar to Llama)
encoding = tiktoken.get_encoding( "cl100k_base" )
return len (encoding.encode(text))
# Example
text = "What is the meaning of life?"
tokens = count_tokens(text)
print ( f "Estimated tokens: { tokens } " ) # ~7 tokens
Counting Message Tokens
Messages have overhead beyond just the content:
def count_message_tokens ( messages ):
encoding = tiktoken.get_encoding( "cl100k_base" )
total = 0
for message in messages:
# Message overhead (role, formatting)
total += 4
# Content tokens
total += len (encoding.encode(message.get( "content" , "" )))
# Name (if present)
if "name" in message:
total += len (encoding.encode(message[ "name" ]))
total += 1 # Name field overhead
# Conversation overhead
total += 2
return total
# Example
messages = [
{ "role" : "system" , "content" : "You are helpful." },
{ "role" : "user" , "content" : "Hello!" }
]
print ( f "Total: { count_message_tokens(messages) } tokens" )
Billing Calculation
You’re billed for total tokens (input + output):
Cost = (prompt_tokens + completion_tokens) × price_per_million / 1,000,000
Example Calculation
# Llama 3.1 8B pricing: $0.10 per million tokens
prompt_tokens = 500
completion_tokens = 1000
total_tokens = 1500
price_per_million = 0.10
cost = total_tokens * price_per_million / 1_000_000
print ( f "Cost: $ { cost :.6f} " ) # $0.000150
Monthly Usage Estimation
def estimate_monthly_cost ( requests_per_day , avg_tokens_per_request , price_per_million ):
daily_tokens = requests_per_day * avg_tokens_per_request
monthly_tokens = daily_tokens * 30
monthly_cost = monthly_tokens * price_per_million / 1_000_000
return {
"daily_tokens" : daily_tokens,
"monthly_tokens" : monthly_tokens,
"monthly_cost" : monthly_cost
}
# Example: 1000 requests/day, 500 tokens each, $0.10/M
estimate = estimate_monthly_cost( 1000 , 500 , 0.10 )
print ( f "Monthly cost: $ { estimate[ 'monthly_cost' ] :.2f} " ) # $1.50
Token Limits
Context Window
Each model has a maximum context window:
Model Context Window llama-3.1-8b128,000 tokens llama-3.1-70b128,000 tokens mistral-7b32,000 tokens phi-3-mini4,000 tokens
If your input exceeds the context window, the request will fail with a context_length_exceeded error.
Output Limits
You can limit output tokens with max_tokens:
response = client.chat.completions.create(
model = "llama-3.1-8b" ,
messages = [{ "role" : "user" , "content" : "Write an essay" }],
max_tokens = 500 # Limit response length
)
Optimizing Token Usage
1. Trim Conversation History
Keep only recent messages:
def trim_messages ( messages , max_tokens = 4000 ):
encoding = tiktoken.get_encoding( "cl100k_base" )
# Always keep system message
result = []
if messages and messages[ 0 ][ "role" ] == "system" :
result.append(messages[ 0 ])
messages = messages[ 1 :]
current_tokens = count_message_tokens(result)
# Add messages from most recent, stop when limit reached
for msg in reversed (messages):
msg_tokens = len (encoding.encode(msg[ "content" ])) + 4
if current_tokens + msg_tokens > max_tokens:
break
result.insert( len ([m for m in result if m[ "role" ] == "system" ]), msg)
current_tokens += msg_tokens
return result
2. Summarize Long Contexts
def summarize_if_long ( text , max_tokens = 2000 ):
tokens = count_tokens(text)
if tokens <= max_tokens:
return text
# Summarize with AI
response = client.chat.completions.create(
model = "llama-3.1-8b" ,
messages = [
{ "role" : "system" , "content" : "Summarize concisely:" },
{ "role" : "user" , "content" : text}
],
max_tokens = max_tokens // 2
)
return response.choices[ 0 ].message.content
3. Use Efficient Prompts
# Verbose (more tokens)
messages = [{
"role" : "user" ,
"content" : "I would like you to please provide me with a comprehensive and detailed explanation of what machine learning is and how it works in general terms."
}]
# Concise (fewer tokens)
messages = [{
"role" : "user" ,
"content" : "Explain machine learning briefly."
}]
4. Cache Responses
Don’t re-request the same information:
from functools import lru_cache
import hashlib
@lru_cache ( maxsize = 1000 )
def cached_completion ( prompt_hash ):
# Implementation
pass
def get_completion ( messages ):
# Create hash of messages for cache key
msg_str = str (messages)
prompt_hash = hashlib.md5(msg_str.encode()).hexdigest()
return cached_completion(prompt_hash)
Tracking Usage
Per-Request Tracking
class UsageTracker :
def __init__ ( self ):
self .total_tokens = 0
self .total_cost = 0
self .request_count = 0
def record ( self , usage , price_per_million ):
tokens = usage.total_tokens
cost = tokens * price_per_million / 1_000_000
self .total_tokens += tokens
self .total_cost += cost
self .request_count += 1
def report ( self ):
return {
"requests" : self .request_count,
"tokens" : self .total_tokens,
"cost" : f "$ { self .total_cost :.4f} "
}
# Usage
tracker = UsageTracker()
response = client.chat.completions.create( ... )
tracker.record(response.usage, price_per_million = 0.10 )
print (tracker.report())
Dashboard Monitoring
Check your usage in real-time at assisters.dev/dashboard/usage .
Token Pricing Reference
Model Price per Million Tokens llama-3.1-8b$0.10 llama-3.1-70b$0.90 mistral-7b$0.10 e5-large-v2$0.01 llama-guard-3$0.20 bge-reranker-v2$0.05
Full Pricing Details See complete pricing for all models and tiers