Skip to content

Resource

Model comparison guide

There is no best model. There is only the best model for this task, at this cost, right now. This guide pairs with chapter two of Artificial Leverage: a plain-language map of the leading systems and how to weigh them. The landscape moves every quarter, so treat the specifics as a snapshot and the framework as the durable part.

Five ways to judge any model

Reasoning depth

How well it holds a complex chain of thought without losing the thread.

Instruction following

How closely it sticks to what you actually asked for.

Context capacity

How much you can load into one conversation before it forgets.

Cost efficiency

What the output costs at the volume you actually use.

Domain speciality

Where it is genuinely stronger: code, analysis, research, or writing.

The current flagships

ModelMakerContextCost (in / out, USD per 1M)Open weights
Claude Opus 4.6Anthropic1M tokens$5 / $25No
GPT-5.2OpenAI200K tokensVariesNo
Gemini 3.1 ProGoogle1M tokens$2 / $12No
DeepSeek V3.2DeepSeek164K tokens$0.26 / $0.38Yes
Llama 4 MaverickMeta1M tokensVariesYes

Specifications from the site capability tracker (METR, official model cards, and Artificial Analysis). Costs and context windows change often; check the live tools below for the current numbers.

What each one is best for

Claude Opus 4.6Long-horizon agentic work, deep analysis, and writing. The longest measured autonomous task horizon.
GPT-5.2Frontier reasoning and software engineering, with strong tool use.
Gemini 3.1 ProResearch synthesis and multimodal work across a very large context window.
DeepSeek V3.2Cost-sensitive workloads and self-hosting. Open weights at a fraction of frontier pricing.
Llama 4 MaverickOn-premise and private deployments where open weights and control matter most.

Go deeper with the live tools

This guide is a starting point. For an interactive walk-through of which model fits a specific task, or for live benchmark data that updates as new models ship, use the tools.