Resource
Model comparison guide
There is no best model. There is only the best model for this task, at this cost, right now. This guide pairs with chapter two of Artificial Leverage: a plain-language map of the leading systems and how to weigh them. The landscape moves every quarter, so treat the specifics as a snapshot and the framework as the durable part.
Five ways to judge any model
Reasoning depth
How well it holds a complex chain of thought without losing the thread.
Instruction following
How closely it sticks to what you actually asked for.
Context capacity
How much you can load into one conversation before it forgets.
Cost efficiency
What the output costs at the volume you actually use.
Domain speciality
Where it is genuinely stronger: code, analysis, research, or writing.
The current flagships
| Model | Maker | Context | Cost (in / out, USD per 1M) | Open weights |
|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 1M tokens | $5 / $25 | No |
| GPT-5.2 | OpenAI | 200K tokens | Varies | No |
| Gemini 3.1 Pro | 1M tokens | $2 / $12 | No | |
| DeepSeek V3.2 | DeepSeek | 164K tokens | $0.26 / $0.38 | Yes |
| Llama 4 Maverick | Meta | 1M tokens | Varies | Yes |
Specifications from the site capability tracker (METR, official model cards, and Artificial Analysis). Costs and context windows change often; check the live tools below for the current numbers.
What each one is best for
Go deeper with the live tools
This guide is a starting point. For an interactive walk-through of which model fits a specific task, or for live benchmark data that updates as new models ship, use the tools.