Model comparison guide

Last updated: March 2026

This guide is updated within days of major model releases. The version in the book was accurate at time of publication; this page always reflects the current landscape. Every stat links to its source.

Claude Opus 4.6

by Anthropic · Released February 2026

200K tokens context

Complex reasoning, extended analysis, long documents, agentic coding, research

Pricing

$15 input / $75 output per million tokens. Included in Claude Pro ($20/mo).

Context window

200K tokens

Strengths

  • + Frontier reasoning capability
  • + METR 50% time horizon: ~14.5 hours
  • + Excellent long-form writing
  • + Agentic tool use and coding

Limitations

  • Slower than smaller models
  • Higher cost per token
  • Can be verbose

Claude Sonnet 4.6

by Anthropic · Released February 2026

200K tokens context

Everyday tasks, coding, analysis, balance of speed and capability

Pricing

$3 input / $15 output per million tokens. Included in Claude Pro ($20/mo).

Context window

200K tokens

Strengths

  • + Fast and capable
  • + Excellent at coding
  • + Strong instruction following
  • + Best value for daily use

Limitations

  • Less depth than Opus on complex tasks
  • Shorter generation limits

GPT-5.4

by OpenAI · Released March 2026

200K tokens context

General purpose, multimodal tasks, reasoning, creative work

Pricing

Included in ChatGPT Plus ($20/mo) and ChatGPT Pro ($200/mo).

Context window

200K tokens

Strengths

  • + Strong multimodal capability
  • + Advanced reasoning
  • + Wide ecosystem and integrations
  • + Image generation built in

Limitations

  • Can hallucinate confidently
  • Expensive at API scale
  • Pro tier needed for full capability

GPT-o3

by OpenAI · Released Mid 2025

200K tokens context

Advanced reasoning, mathematics, research, complex problem solving

Pricing

$10 input / $40 output per million tokens. Available on ChatGPT Pro ($200/mo).

Context window

200K tokens

Strengths

  • + Exceptional reasoning
  • + Strong at maths and science
  • + Chain-of-thought built in
  • + METR 50% time horizon: ~75-90 min

Limitations

  • Slow (thinks before responding)
  • Expensive
  • Can overthink simple tasks

Gemini 3.1 Pro

by Google · Released February 2026

1M tokens context

Multimodal analysis, real-time information, code generation, long context

Pricing

Included in Gemini Advanced ($20/mo). Competitive API pricing.

Context window

1M tokens

Strengths

  • + Largest context window (1M tokens)
  • + ARC-AGI-2 score: 77.1%
  • + Real-time web access
  • + Strong multimodal

Limitations

  • Newer ecosystem than OpenAI/Anthropic
  • Output quality can vary
  • Less established for long-form writing

Gemini 2.5 Flash

by Google · Released March 2025

1M tokens context

Fast tasks, summarisation, classification, high-volume processing

Pricing

$0.15 input / $0.60 output per million tokens. Free tier available.

Context window

1M tokens

Strengths

  • + Very fast
  • + Extremely cheap
  • + Large context window
  • + Good for batch processing

Limitations

  • Less capable on complex tasks
  • Weaker reasoning
  • Less nuanced writing

Quick comparison

ModelProviderContextSpeedCost
Claude Opus 4.6Anthropic200KModerate$$$
Claude Sonnet 4.6Anthropic200KFast$$
GPT-5.4OpenAI200KFast$$
GPT-o3OpenAI200KSlow$$$$
Gemini 3.1 ProGoogle1MFast$$
Gemini 2.5 FlashGoogle1MVery fast$

How to choose

For complex reasoning, extended research, or documents longer than 10,000 words: Claude Opus 4.6 or GPT-o3. These models think deeper and handle nuance better, but they cost more and respond slower.

For everyday work, coding, email drafting, and analysis: Claude Sonnet 4.6 or GPT-5.4. These are the workhorses. Fast, capable, affordable. Start here for most tasks.

For tasks involving images, audio, or video: GPT-5.4 or Gemini 3.1 Pro. Both handle multimodal input well. Gemini has the edge on real-time web access and the largest context window.

For high-volume processing or cost-sensitive tasks: Gemini 2.5 Flash. By far the cheapest, with a massive 1M token context window. Ideal for summarisation, classification, and batch work.

Always test on your specific task. Model performance varies dramatically across different types of work. What works best for coding might not be optimal for writing or analysis.

Sources

Stay updated

Model releases happen fast. Get notified when new models arrive and this guide updates.

Join readers of Leverage