Skip to main content

Token Optimization

The AI Gateway provides tools to understand, analyze, and optimize your AI token usage. Reduce costs while maintaining quality through data-driven decisions.

Understanding Your Usage

Token Analytics Dashboard

Token analytics dashboard
See exactly where your tokens are going:
  • By model - Compare costs across GPT-5.2, Claude, etc.
  • By feature - Which parts of your app use the most tokens
  • By user - Identify heavy users and usage patterns
  • Over time - Track trends and spot anomalies

Cost Breakdown

MetricDescription
Input TokensTokens in the prompt you send
Output TokensTokens in the AI response
Total CostCombined cost (output tokens typically cost more)
RequestsNumber of API calls
Avg Tokens/RequestEfficiency metric

A/B Testing

Test different approaches to find the most cost-effective solution:
A/B test setup

What to Test

Models

GPT-5.2 vs GPT-5.2 vs Claude
  • Quality vs cost tradeoffs
  • Task-specific performance

Prompts

Different system prompts
  • Shorter vs detailed instructions
  • Different tones/styles

Context Length

How much context to include
  • Minimal vs comprehensive
  • Impact on quality

Temperature

Model creativity settings
  • Lower for consistent outputs
  • Higher for variety

Setting Up an A/B Test

// Create an experiment
const experiment = await leanmcp.gateway.createExperiment({
  name: 'Model Comparison Q1 2024',
  variants: [
    { name: 'gpt-5.2', weight: 50, config: { model: 'gpt-5.2' } },
    { name: 'claude-3', weight: 50, config: { model: 'claude-sonnet-4-5-20250929' } },
  ],
  metrics: ['quality_rating', 'tokens_used', 'latency', 'cost'],
  duration: '14d',
});

Using Experiments in Code

// Get variant for user
const variant = await leanmcp.gateway.getVariant({
  experimentId: experiment.id,
  userId: userId,
});

// Use the assigned model
const response = await client.chat.completions.create({
  model: variant.config.model,
  messages: messages,
}, {
  headers: {
    'X-Experiment-ID': experiment.id,
    'X-Variant': variant.name,
  }
});

Tracking Outcomes

// Record quality metric (e.g., from user feedback)
await leanmcp.gateway.recordOutcome({
  experimentId: experiment.id,
  userId: userId,
  outcome: {
    quality_rating: 4.5, // 1-5 scale
    user_satisfied: true,
  }
});

Analyzing Results

A/B test results
The dashboard shows:
  • Statistical significance - Is the difference real?
  • Cost comparison - Savings per variant
  • Quality metrics - User satisfaction scores
  • Recommendation - Which variant to choose

Competitor Insights

Learn from how others optimize:
Industry benchmarks

Benchmarking

Compare your usage to industry averages:
  • Tokens per request - Are your prompts too long?
  • Model distribution - Are you using expensive models unnecessarily?
  • Error rates - Are you making inefficient retries?

Learning from Patterns

Aggregated, anonymized insights from the platform help you understand best practices without exposing anyone’s specific implementation.
Common optimizations we’ve identified:
  • 60% of GPT-5.2 usage can use GPT-5.2 with minimal quality loss
  • Shorter system prompts often perform equally well
  • Caching common queries reduces costs by 30-40%

Optimization Strategies

1. Right-Size Your Models

Not every request needs GPT-5.2:
// Route based on complexity
const model = estimateComplexity(message) > 0.7 
  ? 'gpt-5.2' 
  : 'gpt-5.2';

const response = await client.chat.completions.create({
  model: model,
  messages: messages,
});
Smart model routing

2. Optimize Prompts

Only include information the model actually needs. More context = more tokens.
“You are a helpful coding assistant” works as well as a 500-word description for most tasks.
Use max_tokens to prevent unnecessarily long responses.
“Reply in JSON format” or “Answer in one sentence” reduces output tokens.

3. Implement Caching

Cache identical or similar requests:
// Enable response caching
const response = await client.chat.completions.create({
  model: 'gpt-5.2',
  messages: messages,
}, {
  headers: {
    'X-Enable-Cache': 'true',
    'X-Cache-TTL': '3600', // 1 hour
  }
});

4. Batch Similar Requests

Combine multiple small requests:
// Instead of 10 separate calls
// Batch into one request with multiple items
const response = await client.chat.completions.create({
  model: 'gpt-5.2',
  messages: [{
    role: 'user',
    content: `Analyze these 10 items:\n${items.join('\n')}`
  }],
});

Cost Alerts and Limits

Budget Controls

// Set spending limits
await leanmcp.gateway.setBudget({
  daily: 50.00,
  weekly: 200.00,
  monthly: 500.00,
  action: 'alert', // or 'block' to hard stop
});

Alert Configuration

await leanmcp.gateway.createCostAlert({
  threshold: 100.00, // dollars
  period: 'daily',
  channels: ['email', 'slack'],
});
Cost alert notification

Reporting

Usage Reports

Generate detailed reports:
const report = await leanmcp.gateway.generateReport({
  type: 'usage',
  period: 'monthly',
  groupBy: ['model', 'feature', 'user'],
  format: 'pdf',
});

Export for Analysis

# Export token usage data
curl -X GET "https://api.leanmcp.com/gateway/usage?period=monthly" \
  -H "Authorization: Bearer your-api-key" \
  -o usage-report.json

ROI Calculator

Understand the value of optimization:
ROI calculator
ScenarioCurrent CostOptimized CostSavings
Model right-sizing$1,000/mo$600/mo40%
Prompt optimization$600/mo$450/mo25%
Caching$450/mo$350/mo22%
Total$1,000/mo$350/mo65%

Best Practices

You can’t optimize what you don’t measure. Set up tracking before making changes.
Run isolated A/B tests to understand the impact of each change.
The cheapest option isn’t always the best. Track quality metrics alongside cost.
Usage patterns change. Schedule monthly reviews of your optimization strategies.
Budget alerts prevent surprise bills and catch issues quickly.

Next Steps