Tokens Explained

Learn what tokens are, why they matter for billing and performance, and how to optimize token usage in your AI applications.

What is a Token?

A token is a small chunk of text that a language model processes. Tokens can be:

  • Whole words: "hello" = 1 token

  • Parts of words: "running" = might be 2 tokens ("run" + "ning")

  • Single characters: "!" = 1 token

  • Spaces and punctuation: " " or "," = 1 token

Examples

"Hello, world!" ≈ 4 tokens
["Hello", ",", " world", "!"]

"The quick brown fox" ≈ 5 tokens
["The", " quick", " brown", " fox"]

"Artificial Intelligence" ≈ 3 tokens
["Art", "ificial", " Intelligence"]

"I'm coding in JavaScript" ≈ 6 tokens
["I", "'m", " coding", " in", " JavaScript"]

Rule of Thumb

  • English: ~1 token ≈ 4 characters or 0.75 words

  • Code: ~1 token ≈ 3-4 characters

  • Other languages: May use more tokens per word

Quick estimate: 100 words ≈ 133 tokens

Why Tokens Matter

1. Billing is Based on Tokens

Every API request costs money based on the number of tokens:

Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price)

Example:

Request: "Summarize this article..." (50 tokens input)
Response: "This article discusses..." (100 tokens output)

Using Standard profile:
- Input: 50 × $0.25/M = $0.0000125
- Output: 100 × $2.50/M = $0.00025
Total: $0.0002625 (0.26 credits)

2. Context Limits are in Tokens

Each model profile has a maximum token capacity:

Profile
Max Context Size

LITE

100,000 tokens

STANDARD

200,000 tokens

DEEPTHINK

400,000 tokens

DEV

10,000 tokens

The context includes:

  • System messages

  • Conversation history

  • User prompt

  • AI response

If you exceed the limit, the API returns an error.

3. Response Time Depends on Tokens

More tokens = longer processing time:

  • Input tokens: Time to understand your request

  • Output tokens: Time to generate the response

Tip: Limit max_tokens for faster responses when you don't need long outputs.

Input vs Output Tokens

Input Tokens

Everything you send to the API:

  • System messages

  • User messages

  • Assistant messages (conversation history)

  • Function/tool definitions

const completion = await client.chat.completions.create({
  model: "standard",
  messages: [
    { role: "system", content: "You are helpful." }, // Input
    { role: "user", content: "What is the capital of France?" }, // Input
  ],
});

Output Tokens

Everything the AI generates:

  • Assistant's response content

  • Reasoning tokens (for o1-style models)

// The response content counts as output tokens
console.log(completion.choices[0].message.content);
// "The capital of France is Paris."

Pricing Difference

Output tokens cost significantly more than input tokens:

Profile
Input Price
Output Price
Ratio

lite

$0.05/M

$0.50/M

10x

standard

$0.25/M

$2.50/M

10x

deepthink

$1.25/M

$12.50/M

10x

For the most up-to-date pricing, please refer to the Profile Dashboard (requires login).

Implication: Long outputs cost more than long inputs.

Reasoning Tokens (Thinking Tokens)

Some advanced AI models use reasoning tokens (also called "thinking tokens") during their internal processing before generating the final response.

What Are Reasoning Tokens?

Reasoning tokens represent the model's internal "thinking" process:

  • Internal computation: The model reasons through the problem step-by-step

  • Not visible: These tokens don't appear in the final response

  • Counted as input: Added to your input token count for billing purposes

  • Quality improvement: Help the model provide better, more accurate responses

Which Models Use Reasoning Tokens?

Primarily advanced reasoning models like:

  • OpenAI's 5 series

  • Other models with chain-of-thought capabilities

In Halfred, this primarily affects the DEEPTHINK profile.

Key Points

  • Transparent billing: Reasoning tokens are always included in the prompt_tokens count

  • Variable usage: Complex questions may generate more reasoning tokens

  • Value proposition: You pay for the improved reasoning quality

  • Profile-specific: Most common in DEEPTHINK, rare in LITE/STANDARD

Monitoring Reasoning Tokens

Check the usage object in API responses:

console.log(`Prompt tokens: ${completion.usage.prompt_tokens}`);
console.log(`Completion tokens: ${completion.usage.completion_tokens}`);

// If prompt_tokens is much higher than your input length,
// the model used reasoning tokens

💡 Tip: For cost-sensitive applications, consider using STANDARD or LITE profiles which typically don't use reasoning tokens.

How Halfred Counts Tokens

Understanding how Halfred counts tokens ensures accurate billing and transparency.

Token Counting Methods

Halfred uses two different approaches depending on the stage of your request:

1. Estimation (Before Request)

For pre-request estimates, Halfred uses the tiktoken library, which implements the industry-standard tokenization method used by most AI providers. This allows you to:

  • Plan API calls and estimate costs

  • Validate that content fits within context limits

  • Budget your token usage in advance

2. Actual Billing (After Request)

For billing purposes, Halfred prioritizes the token count returned by the model provider itself:

  • Primary Method: Halfred uses the exact token count reported by the model provider (OpenAI, Anthropic, Google, etc.) in their API response

  • Fallback Method (Rare): If the provider doesn't return a token count, Halfred applies the same tiktoken-based estimation method

Why Use Provider Counts?

Using the provider's actual token count ensures:

  • Accuracy: Reflects the exact tokens processed by the specific model

  • Consistency: Matches how the underlying provider bills

  • Transparency: You see the same counts the provider reports

  • No markup: Token counts are passed through directly with no added charges

Billing Transparency

Every API response includes a usage object showing:

  • prompt_tokens: Input tokens (including reasoning tokens if applicable)

  • completion_tokens: Output tokens generated

  • total_tokens: Sum of input and output tokens

These are the exact numbers used to calculate your bill. All requests are logged with their token counts in your dashboard for full audit trail.

💡 Best Practice: Always check the usage field in responses to track your actual token consumption and costs.

Optimizing Token Usage

1. Write Concise Prompts

// ❌ Verbose (32 tokens)
"I would really appreciate it if you could help me by providing a detailed explanation of how artificial intelligence works";

// ✅ Concise (11 tokens)
"Explain how artificial intelligence works";

2. Limit System Messages

// ❌ Long system message
{
  role: "system",
  content: "You are a highly knowledgeable assistant who always provides detailed, comprehensive, and well-researched answers..."
}

// ✅ Concise system message
{
  role: "system",
  content: "You are a knowledgeable assistant."
}

3. Use max_tokens

Control output length:

await client.chat.completions.create({
  model: "standard",
  messages: [...],
  max_tokens: 100  // Limit to 100 output tokens
});

4. Avoid Unnecessary Conversation History

// ❌ Sending entire history every time
const allMessages = [
  /* 100 previous messages */
];

// ✅ Send only relevant recent messages
const relevantMessages = allMessages.slice(-10);

5. Choose Appropriate Profile

Don't use a large context if you don't need it:

// ❌ Overkill for short prompts
model: "deepthink"; // 400K context for a 10-token prompt

// ✅ Appropriate for the task
model: "lite"; // 100K context is more than enough

Special Token Considerations

Code

Code typically uses more tokens than natural language:

// Natural language: ~15 tokens
"Create a function that adds two numbers";

// Code: ~25 tokens
function add(a, b) {
  return a + b;
}

JSON

JSON structure adds extra tokens:

{
  "name": "John",
  "age": 30
}

This is ~10 tokens due to brackets, quotes, and formatting.

Different Languages

Non-English languages may use more tokens:

  • English: ~133 tokens per 100 words

  • Spanish: ~140 tokens per 100 words

  • Chinese: ~170 tokens per 100 words

  • Arabic: ~180 tokens per 100 words

Monitoring Token Usage

Track Per-Request

const completion = await client.chat.completions.create({
  model: "standard",
  messages: [...]
});

const { prompt_tokens, completion_tokens, total_tokens } = completion.usage;

console.log(`Input: ${prompt_tokens} tokens`);
console.log(`Output: ${completion_tokens} tokens`);
console.log(`Total: ${total_tokens} tokens`);

const cost = (prompt_tokens * 0.75 + completion_tokens * 2.50) / 1_000_000;
console.log(`Cost: $${cost.toFixed(6)}`);

Common Token Errors

Error: Context Length Exceeded

Error: This model's maximum context length is 200000 tokens.
However, your messages resulted in 250000 tokens.

Solutions:

  1. Reduce message history

  2. Shorten your prompt

  3. Use a profile with larger context (deepthink)

  4. Summarize earlier conversation

Error: max_tokens Too Large

Error: max_tokens value exceeds available context

Solution: Reduce max_tokens or shorten input

Frequently Asked Questions

How can I reduce token costs?

  1. Write concise prompts

  2. Limit conversation history

  3. Use max_tokens to control output

  4. Choose the right profile for the task

  5. Cache common responses

Do emojis count as tokens?

Yes! Emojis typically count as 1-2 tokens each: 😀 = 1-2 tokens

Are tokens counted before or after processing?

Token counting happens before the API request is sent, allowing you to stay within limits.

Can I see token counts before making a request?

Yes, use tokenizer libraries (like tiktoken) to count tokens beforehand.

Do system messages count toward the limit?

Yes, everything counts: system messages, user messages, assistant messages, and the generated response.

What happens if I hit the token limit mid-response?

The response will be truncated, and finish_reason will be "length" instead of "stop".

Next Steps

Support

Questions about tokens?

Last updated