Skip to main content

Frontier AI Model Report

February 2026

A comprehensive benchmark analysis for executives, founders and operators

Claude Opus 4.6
Claude Sonnet 4.6
Gemini 3.1 Pro
GPT-5.3 Codex
Grok 4.1
DeepSeek V3.2

What This Report Covers

Side-by-side benchmark results for 6 frontier AI models across reasoning, coding, agentic tasks, multimodal, long context, price, and speed. Sourced from official model cards and third-party evaluations as of February 27, 2026.

Bottom Line

No single model wins every category. Gemini 3.1 Pro leads reasoning. GPT-5.3 Codex leads autonomous coding. Claude Opus 4.6 leads agentic and computer-use tasks. Grok 4.1 leads speed and context. DeepSeek V3.2 leads cost and open-source access.

Quick Answers

Best for reasoning

Gemini 3.1 Pro

94.3% GPQA Diamond

Best for coding

GPT-5.3 Codex

77.3% Terminal-Bench 2.0

Best for agentic tasks

Claude Opus 4.6

53.0% HLE with Tools

Best value

DeepSeek V3.2

$0.28/M input tokens

Section 01

Which AI Model Has the Best Reasoning?

Gemini 3.1 Pro leads PhD-level and abstract reasoning in February 2026, scoring highest on GPQA Diamond (94.3%), ARC-AGI-2 (77.1%), and Humanity's Last Exam without tools (44.4%). Claude Opus 4.6 leads when tools are available, scoring 53.0% on HLE with Tools.

GPQA Diamond (PhD-level science)

Opus 4.6
91.3%
Sonnet 4.6
74.1%
Gemini 3.1
94.3 ★
GPT-5.3
92.4%
Grok 4.1
87.5%
DeepSeek
79.9%
GPQA Diamond (PhD-level science): Claude Opus 4.6
91.3%
GPQA Diamond (PhD-level science): Claude Sonnet 4.6
74.1%
GPQA Diamond (PhD-level science): Gemini 3.1 Pro
94.3%
GPQA Diamond (PhD-level science): GPT-5.3 Codex
92.4%
GPQA Diamond (PhD-level science): Grok 4.1
87.5%
GPQA Diamond (PhD-level science): DeepSeek V3.2
79.9%

ARC-AGI-2 (novel abstract reasoning)

Opus 4.6
68.8%
Sonnet 4.6
60.4%
Gemini 3.1
77.1 ★
GPT-5.3
52.9%
Grok 4.1
N/R
DeepSeek
N/R
ARC-AGI-2 (novel abstract reasoning): Claude Opus 4.6
68.8%
ARC-AGI-2 (novel abstract reasoning): Claude Sonnet 4.6
60.4%
ARC-AGI-2 (novel abstract reasoning): Gemini 3.1 Pro
77.1%
ARC-AGI-2 (novel abstract reasoning): GPT-5.3 Codex
52.9%
ARC-AGI-2 (novel abstract reasoning): Grok 4.1
N/R
ARC-AGI-2 (novel abstract reasoning): DeepSeek V3.2
N/R

Humanity's Last Exam (no tools)

Opus 4.6
40%
Sonnet 4.6
19.1%
Gemini 3.1
44.4 ★
GPT-5.3
34.5%
Grok 4.1
25.4%
DeepSeek
N/A
Humanity's Last Exam (no tools): Claude Opus 4.6
40%
Humanity's Last Exam (no tools): Claude Sonnet 4.6
19.1%
Humanity's Last Exam (no tools): Gemini 3.1 Pro
44.4%
Humanity's Last Exam (no tools): GPT-5.3 Codex
34.5%
Humanity's Last Exam (no tools): Grok 4.1
25.4%
Humanity's Last Exam (no tools): DeepSeek V3.2
N/A

HLE with Tools

Opus 4.6
53 ★
Sonnet 4.6
N/A
Gemini 3.1
51.4%
GPT-5.3
45.5%
Grok 4.1
N/A
DeepSeek
N/A
HLE with Tools: Claude Opus 4.6
53%
HLE with Tools: Claude Sonnet 4.6
N/A
HLE with Tools: Gemini 3.1 Pro
51.4%
HLE with Tools: GPT-5.3 Codex
45.5%
HLE with Tools: Grok 4.1
N/A
HLE with Tools: DeepSeek V3.2
N/A

AIME 2025 (competition math)

Opus 4.6
~82%
Sonnet 4.6
~70%
Gemini 3.1
95%
GPT-5.3
100 ★
Grok 4.1
~95%
DeepSeek
96%
AIME 2025 (competition math): Claude Opus 4.6
~82%
AIME 2025 (competition math): Claude Sonnet 4.6
~70%
AIME 2025 (competition math): Gemini 3.1 Pro
95%
AIME 2025 (competition math): GPT-5.3 Codex
100%
AIME 2025 (competition math): Grok 4.1
~95%
AIME 2025 (competition math): DeepSeek V3.2
96%

MMMLU (multilingual knowledge)

Opus 4.6
91.1%
Sonnet 4.6
79.1%
Gemini 3.1
92.6 ★
GPT-5.3
89.6%
Grok 4.1
N/A
DeepSeek
N/A
MMMLU (multilingual knowledge): Claude Opus 4.6
91.1%
MMMLU (multilingual knowledge): Claude Sonnet 4.6
79.1%
MMMLU (multilingual knowledge): Gemini 3.1 Pro
92.6%
MMMLU (multilingual knowledge): GPT-5.3 Codex
89.6%
MMMLU (multilingual knowledge): Grok 4.1
N/A
MMMLU (multilingual knowledge): DeepSeek V3.2
N/A

Section 02

Which AI Model is Best for Coding and Software Engineering?

GPT-5.3 Codex leads autonomous coding with 77.3% on Terminal-Bench 2.0, 56.8% on SWE-bench Pro, and a perfect 100% on AIME 2025. Claude Opus 4.6 leads computer-use with 72.7% on OSWorld-Verified. Gemini 3.1 Pro leads competitive programming at 75.6% on LiveCodeBench.

SWE-bench Verified (real GitHub bugs)

Opus 4.6
80.8 ★
Sonnet 4.6
79.6%
Gemini 3.1
80.6%
GPT-5.3
~80%
Grok 4.1
~75%
DeepSeek
67.8%
SWE-bench Verified (real GitHub bugs): Claude Opus 4.6
80.8%
SWE-bench Verified (real GitHub bugs): Claude Sonnet 4.6
79.6%
SWE-bench Verified (real GitHub bugs): Gemini 3.1 Pro
80.6%
SWE-bench Verified (real GitHub bugs): GPT-5.3 Codex
~80%
SWE-bench Verified (real GitHub bugs): Grok 4.1
~75%
SWE-bench Verified (real GitHub bugs): DeepSeek V3.2
67.8%

SWE-bench Pro (harder tasks)

Opus 4.6
N/A
Sonnet 4.6
N/A
Gemini 3.1
54.2%
GPT-5.3
56.8 ★
Grok 4.1
N/A
DeepSeek
N/A
SWE-bench Pro (harder tasks): Claude Opus 4.6
N/A
SWE-bench Pro (harder tasks): Claude Sonnet 4.6
N/A
SWE-bench Pro (harder tasks): Gemini 3.1 Pro
54.2%
SWE-bench Pro (harder tasks): GPT-5.3 Codex
56.8%
SWE-bench Pro (harder tasks): Grok 4.1
N/A
SWE-bench Pro (harder tasks): DeepSeek V3.2
N/A

Terminal-Bench 2.0 (agentic terminal)

Opus 4.6
65.4%
Sonnet 4.6
59.1%
Gemini 3.1
68.5%
GPT-5.3
77.3 ★
Grok 4.1
N/R
DeepSeek
31.3%
Terminal-Bench 2.0 (agentic terminal): Claude Opus 4.6
65.4%
Terminal-Bench 2.0 (agentic terminal): Claude Sonnet 4.6
59.1%
Terminal-Bench 2.0 (agentic terminal): Gemini 3.1 Pro
68.5%
Terminal-Bench 2.0 (agentic terminal): GPT-5.3 Codex
77.3%
Terminal-Bench 2.0 (agentic terminal): Grok 4.1
N/R
Terminal-Bench 2.0 (agentic terminal): DeepSeek V3.2
31.3%

OSWorld-Verified (computer-use agent)

Opus 4.6
72.7 ★
Sonnet 4.6
72.5%
Gemini 3.1
N/R
GPT-5.3
64.7%
Grok 4.1
N/R
DeepSeek
N/R
OSWorld-Verified (computer-use agent): Claude Opus 4.6
72.7%
OSWorld-Verified (computer-use agent): Claude Sonnet 4.6
72.5%
OSWorld-Verified (computer-use agent): Gemini 3.1 Pro
N/R
OSWorld-Verified (computer-use agent): GPT-5.3 Codex
64.7%
OSWorld-Verified (computer-use agent): Grok 4.1
N/R
OSWorld-Verified (computer-use agent): DeepSeek V3.2
N/R

LiveCodeBench (competitive programming)

Opus 4.6
~62%
Sonnet 4.6
~58%
Gemini 3.1
75.6 ★
GPT-5.3
~65%
Grok 4.1
~63%
DeepSeek
74.1%
LiveCodeBench (competitive programming): Claude Opus 4.6
~62%
LiveCodeBench (competitive programming): Claude Sonnet 4.6
~58%
LiveCodeBench (competitive programming): Gemini 3.1 Pro
75.6%
LiveCodeBench (competitive programming): GPT-5.3 Codex
~65%
LiveCodeBench (competitive programming): Grok 4.1
~63%
LiveCodeBench (competitive programming): DeepSeek V3.2
74.1%

Section 03

Which AI Model is Best for Agentic Tasks and Tool Use?

Claude Sonnet 4.6 leads office productivity with a GDPval-AA Elo of 1,633. Claude Opus 4.6 leads customer service agent tasks (93.5%) and finance agent benchmarks (60.7%). Gemini 3.1 Pro leads general web research at 85.9% on BrowseComp.

GDPval-AA Elo bars normalized to 1,700 max.

tau2-bench Retail (customer service agent)

Opus 4.6
93.5 ★
Sonnet 4.6
91.7%
Gemini 3.1
90.8%
GPT-5.3
82%
Grok 4.1
N/A
DeepSeek
N/A
tau2-bench Retail (customer service agent): Claude Opus 4.6
93.5%
tau2-bench Retail (customer service agent): Claude Sonnet 4.6
91.7%
tau2-bench Retail (customer service agent): Gemini 3.1 Pro
90.8%
tau2-bench Retail (customer service agent): GPT-5.3 Codex
82%
tau2-bench Retail (customer service agent): Grok 4.1
N/A
tau2-bench Retail (customer service agent): DeepSeek V3.2
N/A

GDPval-AA Elo (office productivity)

Opus 4.6
1,606
Sonnet 4.6
1,633 ★
Gemini 3.1
1,317
GPT-5.3
1,462
Grok 4.1
N/A
DeepSeek
N/A
GDPval-AA Elo (office productivity): Claude Opus 4.6
1,606 Elo
GDPval-AA Elo (office productivity): Claude Sonnet 4.6
1,633 Elo
GDPval-AA Elo (office productivity): Gemini 3.1 Pro
1,317 Elo
GDPval-AA Elo (office productivity): GPT-5.3 Codex
1,462 Elo
GDPval-AA Elo (office productivity): Grok 4.1
N/A
GDPval-AA Elo (office productivity): DeepSeek V3.2
N/A

BrowseComp (web research agent)

Opus 4.6
84%
Sonnet 4.6
N/A
Gemini 3.1
85.9 ★
GPT-5.3
65.8%
Grok 4.1
N/A
DeepSeek
N/A
BrowseComp (web research agent): Claude Opus 4.6
84%
BrowseComp (web research agent): Claude Sonnet 4.6
N/A
BrowseComp (web research agent): Gemini 3.1 Pro
85.9%
BrowseComp (web research agent): GPT-5.3 Codex
65.8%
BrowseComp (web research agent): Grok 4.1
N/A
BrowseComp (web research agent): DeepSeek V3.2
N/A

Finance Agent

Opus 4.6
60.7%
Sonnet 4.6
63.3 ★
Gemini 3.1
N/A
GPT-5.3
N/A
Grok 4.1
N/A
DeepSeek
N/A
Finance Agent: Claude Opus 4.6
60.7%
Finance Agent: Claude Sonnet 4.6
63.3%
Finance Agent: Gemini 3.1 Pro
N/A
Finance Agent: GPT-5.3 Codex
N/A
Finance Agent: Grok 4.1
N/A
Finance Agent: DeepSeek V3.2
N/A

Section 04

Which AI Model Has the Best Multimodal and Long Context Support?

Gemini 3.1 Pro leads vision and reasoning at 80.5% on MMMU-Pro. Claude Opus 4.6 and Gemini 3.1 Pro tie for 128K long-context retrieval at 84.9%. Claude Opus 4.6 leads extreme 1-million-token context at 76.0%. DeepSeek V3.2 is text-only with no vision support.

MMMU-Pro (vision + reasoning)

Opus 4.6
73.9%
Sonnet 4.6
~68%
Gemini 3.1
80.5 ★
GPT-5.3
79.5%
Grok 4.1
N/A
DeepSeek
text-only
MMMU-Pro (vision + reasoning): Claude Opus 4.6
73.9%
MMMU-Pro (vision + reasoning): Claude Sonnet 4.6
~68%
MMMU-Pro (vision + reasoning): Gemini 3.1 Pro
80.5%
MMMU-Pro (vision + reasoning): GPT-5.3 Codex
79.5%
MMMU-Pro (vision + reasoning): Grok 4.1
N/A
MMMU-Pro (vision + reasoning): DeepSeek V3.2
text-only

MRCR v2 128K (long-context retrieval)

Opus 4.6
84.9 ★
Sonnet 4.6
~81%
Gemini 3.1
84.9 ★
GPT-5.3
83.8%
Grok 4.1
N/A
DeepSeek
N/A
MRCR v2 128K (long-context retrieval): Claude Opus 4.6
84.9%
MRCR v2 128K (long-context retrieval): Claude Sonnet 4.6
~81%
MRCR v2 128K (long-context retrieval): Gemini 3.1 Pro
84.9%
MRCR v2 128K (long-context retrieval): GPT-5.3 Codex
83.8%
MRCR v2 128K (long-context retrieval): Grok 4.1
N/A
MRCR v2 128K (long-context retrieval): DeepSeek V3.2
N/A

MRCR v2 1M tokens (extreme long context)

Opus 4.6
76 ★
Sonnet 4.6
N/A
Gemini 3.1
26.3†
GPT-5.3
N/A
Grok 4.1
N/A
DeepSeek
N/A
MRCR v2 1M tokens (extreme long context): Claude Opus 4.6
76%
MRCR v2 1M tokens (extreme long context): Claude Sonnet 4.6
N/A
MRCR v2 1M tokens (extreme long context): Gemini 3.1 Pro
26.3†
MRCR v2 1M tokens (extreme long context): GPT-5.3 Codex
N/A
MRCR v2 1M tokens (extreme long context): Grok 4.1
N/A
MRCR v2 1M tokens (extreme long context): DeepSeek V3.2
N/A

Section 05

Context Windows, API Access, and Release Dates

Full specifications for all six frontier models as of February 2026, including pricing, output speed, context window size, and the Artificial Analysis Intelligence Index composite score.

Claude Opus 4.6Claude Sonnet 4.6Gemini 3.1 ProGPT-5.3 CodexGrok 4.1DeepSeek V3.2
ReleasedFeb 5 '26Feb 17 '26Feb 19 '26Feb 5 '26Nov 17 '25Dec 1 '25
Context Window200K (1Mβ)200K (1Mβ)1M std~200K2M128K
Input $/1M tokens$5.00$3.00$2.00~$1.75$0.20$0.28
Output $/1M tokens$25.00$15.00$12.00~$14.00$0.50$0.42
Speed (tokens/sec)67-7254-5691-110~8911849
Arena Elo~1,506 #1Top 5~1,492Testing~1,483N/R
AA Intelligence Index535257N/R35†32
Open weightsNoNoNoNoNoMIT ✓

Section 06

Which Frontier AI Model Offers the Best Value?

Output cost and inference speed determine which model is practical at scale. Grok 4.1 leads speed at 118 tokens per second. DeepSeek V3.2 leads cost at $0.42 per million output tokens.

Output Price: $/1M Tokens

Opus 4.6
$25.00
Sonnet 4.6
$15.00
Gemini 3.1
$12.00
GPT-5.3
$14.00
Grok 4.1
$0.50
DeepSeek
$0.42

Speed: Tokens per Second

Opus 4.6
67-72 t/s
Sonnet 4.6
54-56 t/s
Gemini 3.1
91-110 t/s
GPT-5.3
~89 t/s
Grok 4.1
118 t/s
DeepSeek
49 t/s

Section 07

Where Each Model Wins

Claude Opus 4.6

Best for complex agentic work

  • OSWorld computer-use leader (72.7%)
  • HLE with Tools leader (53.0%)
  • tau2-bench Retail (93.5%)
  • Arena Elo #1 (~1,506)
  • 1M-token long context (76.0%)
  • Human preference overall

Claude Sonnet 4.6

Best production value

  • Office work Elo #1 (1,633)
  • Finance agent leader (63.3%)
  • Near-Opus quality at 60% cost
  • 79.6% SWE-bench Verified

Gemini 3.1 Pro

Best reasoning + price-performance

  • ARC-AGI-2 leader (77.1%)
  • GPQA Diamond leader (94.3%)
  • HLE no-tools leader (44.4%)
  • AA Intelligence Index #1 (57)
  • Fastest major model (91-110 t/s)
  • 1M context standard

GPT-5.3 Codex

Best autonomous coding

  • Terminal-Bench 2.0 leader (77.3%)
  • SWE-bench Pro leader (56.8%)
  • AIME 2025 perfect score (100%)
  • FrontierMath Tiers 1-3 (40.3%)

Grok 4.1

Best speed + value

  • Fastest output (118 t/s)
  • Largest context window (2M tokens)
  • EQ-Bench #1 (1,586)
  • Built-in real-time X/web search
  • $0.50/M output: 50x cheaper than Opus

DeepSeek V3.2

Best open-source

  • Only fully open-weight model (MIT)
  • AIME 2025 (96.0%)
  • LiveCodeBench (74.1%)
  • $0.42/M output: cheapest API
  • Self-hostable on your own infrastructure

Original Analysis

Price-Performance Index

Attainment's Price-Performance Index divides each model's AA Intelligence Index score by its output cost per million tokens. Higher is better. GPT-5.3 Codex is excluded: no AA Intelligence Index score is published as of this report.

ModelAA IndexOutput $/1MPrice-Performance
DeepSeek V3.232$0.4276.2 pts/$
Grok 4.135$0.5070.0 pts/$
Gemini 3.1 Pro57$12.004.8 pts/$
Claude Sonnet 4.652$15.003.5 pts/$
Claude Opus 4.653$25.002.1 pts/$
GPT-5.3 CodexN/A$14.00N/A

Price-Performance = AA Intelligence Index divided by output cost per million tokens. Higher score means more intelligence per dollar. Calculated by Attainment, February 2026.

Methodology

How We Scored These Models

Benchmark scores were collected from official model cards, provider documentation, and independent evaluation platforms. All data reflects published results as of February 27, 2026. We selected benchmarks that measure distinct capabilities: PhD-level science knowledge (GPQA Diamond), abstract reasoning (ARC-AGI-2), real-world software engineering (SWE-bench), autonomous computing (OSWorld), and vision reasoning (MMMU-Pro). The AA Intelligence Index from Artificial Analysis provides a composite score normalized across evaluations.

Primary Sources

Summary

Key Takeaways

  1. 1

    Gemini 3.1 Pro: Gemini 3.1 Pro leads raw reasoning: GPQA Diamond 94.3%, ARC-AGI-2 77.1%, AA Intelligence Index 57. Best reasoning-per-dollar of any frontier model in this report.

  2. 2

    Claude Opus 4.6: Claude Opus 4.6 ranks highest by human preference (Arena Elo ~1,506) and leads agentic benchmarks: OSWorld 72.7%, HLE with Tools 53.0%, tau2-bench Retail 93.5%.

  3. 3

    GPT-5.3 Codex: GPT-5.3 Codex leads autonomous coding: Terminal-Bench 2.0 at 77.3%, SWE-bench Pro at 56.8%, and a perfect 100% on AIME 2025 math.

  4. 4

    Claude Sonnet 4.6: Claude Sonnet 4.6 is the best production value: leads office workflow Elo (1,633) and finance agent tasks (63.3%) at 60% of Claude Opus pricing.

  5. 5

    Grok 4.1: Grok 4.1 is the fastest model (118 tokens per second) with the largest context window (2M tokens) and output pricing of $0.50/M: 50x cheaper than Claude Opus 4.6.

  6. 6

    DeepSeek V3.2: DeepSeek V3.2 is the only MIT-licensed open-weight model, with competitive coding scores (LiveCodeBench 74.1%) and the cheapest API at $0.42/M output.

FAQ

Frequently Asked Questions

Which AI model is best for reasoning in 2026?
Gemini 3.1 Pro leads reasoning benchmarks in February 2026, scoring 94.3% on GPQA Diamond, 77.1% on ARC-AGI-2, and 44.4% on Humanity's Last Exam without tools. Claude Opus 4.6 leads agentic reasoning tasks requiring tools, scoring 53.0% on HLE with Tools.
Which AI model is best for coding in 2026?
GPT-5.3 Codex leads autonomous coding benchmarks in February 2026, scoring 77.3% on Terminal-Bench 2.0, 56.8% on SWE-bench Pro, and a perfect 100% on AIME 2025. Claude Opus 4.6 leads computer-use tasks with 72.7% on OSWorld-Verified.
Which frontier AI model has the largest context window?
Grok 4.1 has the largest standard context window at 2 million tokens as of February 2026. Gemini 3.1 Pro offers 1 million tokens as standard. Claude Opus 4.6 and Sonnet 4.6 support 200K tokens with 1 million in beta.
What is the cheapest frontier AI model in 2026?
DeepSeek V3.2 is the cheapest at $0.28 per million input tokens and $0.42 per million output tokens as of February 2026. Among proprietary models, Grok 4.1 is most affordable at $0.20 per million input and $0.50 per million output.
Which AI model is fastest in 2026?
Grok 4.1 is the fastest frontier model at 118 tokens per second as of February 2026. Gemini 3.1 Pro is second at 91 to 110 tokens per second. Claude Opus 4.6 outputs at 67 to 72 tokens per second.
Is DeepSeek V3.2 open source?
Yes. DeepSeek V3.2 is the only fully open-weight frontier model in this comparison, released under an MIT license. It can be self-hosted on your own infrastructure.
How does Claude Opus 4.6 compare to GPT-5.3 Codex?
Claude Opus 4.6 and GPT-5.3 Codex score nearly identically on SWE-bench Verified: 80.8% vs 80.0%. Opus 4.6 leads on computer-use, agentic tasks, and human preference. GPT-5.3 Codex leads on autonomous coding and math. Opus 4.6 costs $25 per million output tokens versus approximately $14 for GPT-5.3.

David Cyrus

Founder, Attainment

LinkedIn·About

Attainment is an AI-powered growth and operations firm based in Toronto. This report is produced independently with no paid placement or model provider sponsorship. All benchmark data is sourced from official model cards and third-party evaluations as of February 27, 2026.

Published

† Gemini MRCR 1M uses a different evaluation variant.   † Grok 4.1 AA Index is for the Fast (Reasoning) variant.

N/R = not run.   N/A = not applicable.   ~ = estimated.   ★ = category leader.

Sources: Anthropic, Google DeepMind, OpenAI, xAI, DeepSeek model cards; Artificial Analysis Intelligence Index v4.0; LMSYS Chatbot Arena; Vals AI; Digital Applied LLM Comparison. All scores as of Feb 27, 2026.