# Tokens Explained

## What is a Token?

A **token** is a small chunk of text that a language model processes. Tokens can be:

* Whole words: `"hello"` = 1 token
* Parts of words: `"running"` = might be 2 tokens (`"run"` + `"ning"`)
* Single characters: `"!"` = 1 token
* Spaces and punctuation: `" "` or `","` = 1 token

### Examples

```
"Hello, world!" ≈ 4 tokens
["Hello", ",", " world", "!"]

"The quick brown fox" ≈ 5 tokens
["The", " quick", " brown", " fox"]

"Artificial Intelligence" ≈ 3 tokens
["Art", "ificial", " Intelligence"]

"I'm coding in JavaScript" ≈ 6 tokens
["I", "'m", " coding", " in", " JavaScript"]
```

### Rule of Thumb

* **English**: \~1 token ≈ 4 characters or 0.75 words
* **Code**: \~1 token ≈ 3-4 characters
* **Other languages**: May use more tokens per word

**Quick estimate**: 100 words ≈ 133 tokens

## Why Tokens Matter

### 1. Billing is Based on Tokens

Every API request costs money based on the number of tokens:

```
Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price)
```

**Example**:

```javascript
Request: "Summarize this article..." (50 tokens input)
Response: "This article discusses..." (100 tokens output)

Using Standard profile:
- Input: 50 × $0.25/M = $0.0000125
- Output: 100 × $2.50/M = $0.00025
Total: $0.0002625 (0.26 credits)
```

### 2. Context Limits are in Tokens

Each model profile has a maximum token capacity:

| Profile       | Max Context Size |
| ------------- | ---------------- |
| **LITE**      | 100,000 tokens   |
| **STANDARD**  | 200,000 tokens   |
| **DEEPTHINK** | 400,000 tokens   |
| **DEV**       | 10,000 tokens    |

The context includes:

* System messages
* Conversation history
* User prompt
* AI response

**If you exceed the limit**, the API returns an error.

### 3. Response Time Depends on Tokens

More tokens = longer processing time:

* **Input tokens**: Time to understand your request
* **Output tokens**: Time to generate the response

**Tip**: Limit `max_tokens` for faster responses when you don't need long outputs.

## Input vs Output Tokens

### Input Tokens

Everything you **send** to the API:

* System messages
* User messages
* Assistant messages (conversation history)
* Function/tool definitions

```javascript
const completion = await client.chat.completions.create({
  model: "standard",
  messages: [
    { role: "system", content: "You are helpful." }, // Input
    { role: "user", content: "What is the capital of France?" }, // Input
  ],
});
```

### Output Tokens

Everything the AI **generates**:

* Assistant's response content
* Reasoning tokens (for o1-style models)

```javascript
// The response content counts as output tokens
console.log(completion.choices[0].message.content);
// "The capital of France is Paris."
```

### Pricing Difference

Output tokens cost significantly more than input tokens:

| Profile   | Input Price | Output Price | Ratio |
| --------- | ----------- | ------------ | ----- |
| lite      | $0.05/M     | $0.50/M      | 10x   |
| standard  | $0.25/M     | $2.50/M      | 10x   |
| deepthink | $1.25/M     | $12.50/M     | 10x   |

*For the most up-to-date pricing, please refer to the* [*Profile Dashboard*](https://halfred.ai/app/project-profile) *(requires login).*

**Implication**: Long outputs cost more than long inputs.

## Reasoning Tokens (Thinking Tokens)

Some advanced AI models use **reasoning tokens** (also called "thinking tokens") during their internal processing before generating the final response.

### What Are Reasoning Tokens?

Reasoning tokens represent the model's internal "thinking" process:

* **Internal computation**: The model reasons through the problem step-by-step
* **Not visible**: These tokens don't appear in the final response
* **Counted as input**: Added to your input token count for billing purposes
* **Quality improvement**: Help the model provide better, more accurate responses

### Which Models Use Reasoning Tokens?

Primarily advanced reasoning models like:

* OpenAI's 5 series
* Other models with chain-of-thought capabilities

In Halfred, this primarily affects the **DEEPTHINK** profile.

### Key Points

* **Transparent billing**: Reasoning tokens are always included in the `prompt_tokens` count
* **Variable usage**: Complex questions may generate more reasoning tokens
* **Value proposition**: You pay for the improved reasoning quality
* **Profile-specific**: Most common in DEEPTHINK, rare in LITE/STANDARD

### Monitoring Reasoning Tokens

Check the `usage` object in API responses:

```javascript
console.log(`Prompt tokens: ${completion.usage.prompt_tokens}`);
console.log(`Completion tokens: ${completion.usage.completion_tokens}`);

// If prompt_tokens is much higher than your input length,
// the model used reasoning tokens
```

💡 **Tip**: For cost-sensitive applications, consider using STANDARD or LITE profiles which typically don't use reasoning tokens.

## How Halfred Counts Tokens

Understanding how Halfred counts tokens ensures accurate billing and transparency.

### Token Counting Methods

Halfred uses two different approaches depending on the stage of your request:

#### 1. Estimation (Before Request)

For pre-request estimates, Halfred uses the **tiktoken library**, which implements the industry-standard tokenization method used by most AI providers. This allows you to:

* Plan API calls and estimate costs
* Validate that content fits within context limits
* Budget your token usage in advance

#### 2. Actual Billing (After Request)

For billing purposes, **Halfred prioritizes the token count returned by the model provider itself**:

* **Primary Method**: Halfred uses the exact token count reported by the model provider (OpenAI, Anthropic, Google, etc.) in their API response
* **Fallback Method (Rare)**: If the provider doesn't return a token count, Halfred applies the same tiktoken-based estimation method

### Why Use Provider Counts?

Using the provider's actual token count ensures:

* **Accuracy**: Reflects the exact tokens processed by the specific model
* **Consistency**: Matches how the underlying provider bills
* **Transparency**: You see the same counts the provider reports
* **No markup**: Token counts are passed through directly with no added charges

### Billing Transparency

Every API response includes a `usage` object showing:

* **prompt\_tokens**: Input tokens (including reasoning tokens if applicable)
* **completion\_tokens**: Output tokens generated
* **total\_tokens**: Sum of input and output tokens

These are the exact numbers used to calculate your bill. All requests are logged with their token counts in your dashboard for full audit trail.

💡 **Best Practice**: Always check the `usage` field in responses to track your actual token consumption and costs.

## Optimizing Token Usage

### 1. Write Concise Prompts

```javascript
// ❌ Verbose (32 tokens)
"I would really appreciate it if you could help me by providing a detailed explanation of how artificial intelligence works";

// ✅ Concise (11 tokens)
"Explain how artificial intelligence works";
```

### 2. Limit System Messages

```javascript
// ❌ Long system message
{
  role: "system",
  content: "You are a highly knowledgeable assistant who always provides detailed, comprehensive, and well-researched answers..."
}

// ✅ Concise system message
{
  role: "system",
  content: "You are a knowledgeable assistant."
}
```

### 3. Use max\_tokens

Control output length:

```javascript
await client.chat.completions.create({
  model: "standard",
  messages: [...],
  max_tokens: 100  // Limit to 100 output tokens
});
```

### 4. Avoid Unnecessary Conversation History

```javascript
// ❌ Sending entire history every time
const allMessages = [
  /* 100 previous messages */
];

// ✅ Send only relevant recent messages
const relevantMessages = allMessages.slice(-10);
```

### 5. Choose Appropriate Profile

Don't use a large context if you don't need it:

```javascript
// ❌ Overkill for short prompts
model: "deepthink"; // 400K context for a 10-token prompt

// ✅ Appropriate for the task
model: "lite"; // 100K context is more than enough
```

## Special Token Considerations

### Code

Code typically uses more tokens than natural language:

```javascript
// Natural language: ~15 tokens
"Create a function that adds two numbers";

// Code: ~25 tokens
function add(a, b) {
  return a + b;
}
```

### JSON

JSON structure adds extra tokens:

```json
{
  "name": "John",
  "age": 30
}
```

This is \~10 tokens due to brackets, quotes, and formatting.

### Different Languages

Non-English languages may use more tokens:

* English: \~133 tokens per 100 words
* Spanish: \~140 tokens per 100 words
* Chinese: \~170 tokens per 100 words
* Arabic: \~180 tokens per 100 words

## Monitoring Token Usage

### Track Per-Request

```javascript
const completion = await client.chat.completions.create({
  model: "standard",
  messages: [...]
});

const { prompt_tokens, completion_tokens, total_tokens } = completion.usage;

console.log(`Input: ${prompt_tokens} tokens`);
console.log(`Output: ${completion_tokens} tokens`);
console.log(`Total: ${total_tokens} tokens`);

const cost = (prompt_tokens * 0.75 + completion_tokens * 2.50) / 1_000_000;
console.log(`Cost: $${cost.toFixed(6)}`);
```

## Common Token Errors

### Error: Context Length Exceeded

```
Error: This model's maximum context length is 200000 tokens.
However, your messages resulted in 250000 tokens.
```

**Solutions**:

1. Reduce message history
2. Shorten your prompt
3. Use a profile with larger context (deepthink)
4. Summarize earlier conversation

### Error: max\_tokens Too Large

```
Error: max_tokens value exceeds available context
```

**Solution**: Reduce `max_tokens` or shorten input

## Frequently Asked Questions

### How can I reduce token costs?

1. Write concise prompts
2. Limit conversation history
3. Use `max_tokens` to control output
4. Choose the right profile for the task
5. Cache common responses

### Do emojis count as tokens?

Yes! Emojis typically count as 1-2 tokens each: 😀 = 1-2 tokens

### Are tokens counted before or after processing?

Token counting happens before the API request is sent, allowing you to stay within limits.

### Can I see token counts before making a request?

Yes, use tokenizer libraries (like tiktoken) to count tokens beforehand.

### Do system messages count toward the limit?

Yes, everything counts: system messages, user messages, assistant messages, and the generated response.

### What happens if I hit the token limit mid-response?

The response will be truncated, and `finish_reason` will be `"length"` instead of `"stop"`.

## Next Steps

* Learn about [Pricing](/03-concepts/02-pricing.md)
* Understand [Model Profiles](/03-concepts/01-profiles.md)
* Optimize with [Best Practices](/05-advanced/01-best-practices.md)
* Start building with our [Quick Start Guide](/01-getting-started/01-quickstart.md)

## Support

Questions about tokens?

* **Email**: <support@halfred.ai>
* **Discord**: [Join our community](https://discord.gg/wS2awX4EV7)
* **Dashboard**: Monitor usage at [halfred.ai](https://halfred.ai)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.halfred.ai/03-concepts/03-tokens.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
