Which AI model is best for reasoning in March 2026?

Gemini 3.1 Pro leads reasoning benchmarks in March 2026, scoring 94.3% on GPQA Diamond, 77.1% on ARC-AGI-2, and 44.4% on Humanity's Last Exam without tools. GPT-5.4 scores 92.8% on GPQA Diamond and 73.3% on ARC-AGI-2.

Which AI model is best for coding in March 2026?

Claude Opus 4.6 leads SWE-bench Verified at 80.8%. GPT-5.4 leads SWE-bench Pro at 57.7% and Terminal-Bench 2.0 at 75.1%. GPT-5.4 also became the first model to surpass human experts on OSWorld with 75.0% versus the human baseline of 72.4%.

What is the fastest AI model in March 2026?

Grok 4.20 Beta is the fastest frontier model at 212 to 232 tokens per second as of March 2026. Gemini 3.1 Pro is second at approximately 120 tokens per second. GPT-5.4 runs at 82 tokens per second.

What is the cheapest frontier AI model in March 2026?

DeepSeek V3.2 remains the cheapest at $0.28 per million input tokens and $0.42 per million output tokens. MiniMax M2.7 offers the best value at frontier intelligence, with an AA Intelligence Index of 50 at $1.20 per million output tokens.

What is MiniMax M2.7?

MiniMax M2.7 is a proprietary reasoning model from Chinese AI lab MiniMax, released March 18, 2026. It uses a 230B-parameter mixture-of-experts architecture with roughly 10B active parameters. It scores 50 on the Artificial Analysis Intelligence Index at $1.20 per million output tokens, making it one of the best value frontier models available.

GLM-5 is an open-weight language model from Zhipu AI, released February 11, 2026 under an MIT license. It has 744B total parameters with 40B active per token. It is the first open-weight model to score 50 on the Artificial Analysis Intelligence Index and was trained entirely on Huawei Ascend chips with no NVIDIA hardware.

How does GPT-5.4 compare to GPT-5.3 Codex?

GPT-5.4 improves significantly over GPT-5.3 Codex. OSWorld jumps from 64.7% to 75.0% (surpassing human experts). SWE-bench Pro rises from 56.8% to 57.7%. The context window expands to 1.05 million tokens. Pricing is slightly higher at $2.50 per million input (versus $1.75) and $15 per million output (versus $14).

Which frontier AI model has the largest context window?

Grok 4.20 Beta has the largest standard context window at 2 million tokens as of March 2026. GPT-5.4 offers 1.05 million tokens. Gemini 3.1 Pro offers 1 million tokens. Claude Opus 4.6 and Sonnet 4.6 support 200K tokens with 1 million in beta.

Frontier AI Model Report

March 2026

8 frontier models compared across reasoning, coding, agentic tasks, multimodal, speed, and pricing. For executives, founders, and operators.

Claude Opus 4.6

Claude Sonnet 4.6

Gemini 3.1 Pro

GPT-5.4

Grok 4.20 Beta

DeepSeek V3.2

MiniMax M2.7

GLM-5

What This Report Covers

Side-by-side benchmark results for 8 frontier AI models across reasoning, coding, agentic tasks, multimodal, long context, price, and speed. This edition adds GPT-5.4, Grok 4.20 Beta, MiniMax M2.7, and GLM-5 to the comparison. Sourced from official model cards and third-party evaluations as of March 20, 2026.

Bottom Line

No single model wins every category. GPT-5.4 is the biggest mover, surpassing human experts on OSWorld. Gemini 3.1 Pro holds the reasoning crown. Claude Opus 4.6 leads human preference and agentic work. MiniMax M2.7 and GLM-5 prove frontier intelligence is no longer exclusive to Western labs or closed models.

What Changed This Month

View February edition →

NEW

GPT-5.4 surpasses human experts

First model to beat human baseline on OSWorld (75.0% vs 72.4%). 1.05M context window. Replaces GPT-5.3 Codex in this report.

NEW

Grok 4.20 Beta: 212+ tokens/sec

Nearly 2x the speed of any competitor. Non-hallucination record of 78%. 2M token context. Replaces Grok 4.1.

NEW

MiniMax M2.7 enters the report

AA Intelligence Index 50 at $1.20/M output. Self-evolving model that handled 30-50% of its own RL research.

NEW

GLM-5 enters as open-source leader

First open-weight model to hit AA Index 50. 744B/40B MoE, MIT license. Trained entirely on Huawei chips.

Quick Answers

Best for reasoning

Gemini 3.1 Pro

94.3% GPQA Diamond

Best for coding

GPT-5.4

75.0% OSWorld (beats humans)

Best for agentic tasks

Claude Opus 4.6

53.0% HLE with Tools

Best value

MiniMax M2.7

AA Index 50, $1.20/M output

Section 01

Which AI Model Has the Best Reasoning?

Gemini 3.1 Pro leads PhD-level science and abstract reasoning in March 2026, scoring highest on GPQA Diamond (94.3%) and ARC-AGI-2 (77.1%). GPT-5.4 is a strong second at 92.8% GPQA Diamond and 73.3% ARC-AGI-2. On Humanity's Last Exam with tools, Claude Opus 4.6 leads at 53.0%.

GPQA Diamond (PhD-level science)

Opus 4.6

91.3%

Sonnet 4.6

74.1%

Gemini 3.1

94.3 ★

GPT-5.4

92.8%

Grok 4.20

~88.5%

DeepSeek

79.9%

MiniMax

~88.6%

GLM-5

86%

ARC-AGI-2 (novel abstract reasoning)

Opus 4.6

68.8%

Sonnet 4.6

60.4%

Gemini 3.1

77.1 ★

GPT-5.4

73.3%

Grok 4.20

N/R

DeepSeek

N/R

MiniMax

N/R

GLM-5

N/R

Humanity's Last Exam (no tools)

Opus 4.6

40%

Sonnet 4.6

19.1%

Gemini 3.1

44.4 ★

GPT-5.4

42%

Grok 4.20

~44.4%

DeepSeek

N/A

MiniMax

~28%

GLM-5

30.5%

HLE with Tools

Opus 4.6

53 ★

Sonnet 4.6

N/A

Gemini 3.1

51.4%

GPT-5.4

52.1%

Grok 4.20

N/A

DeepSeek

N/A

MiniMax

N/A

GLM-5

50.4%

AIME 2025 (competition math)

Opus 4.6

~82%

Sonnet 4.6

~70%

Gemini 3.1

95%

GPT-5.4

100 ★

Grok 4.20

~100%

DeepSeek

96%

MiniMax

N/R

GLM-5

N/R

MMMLU (multilingual knowledge)

Opus 4.6

91.1%

Sonnet 4.6

79.1%

Gemini 3.1

92.6 ★

GPT-5.4

~88.5%

Grok 4.20

N/A

DeepSeek

N/A

MiniMax

N/A

GLM-5

N/A

Section 02

Which AI Model is Best for Coding and Software Engineering?

GPT-5.4 is the new coding leader in March 2026, surpassing human experts on OSWorld (75.0% vs 72.4% human baseline) and leading SWE-bench Pro at 57.7%. Claude Opus 4.6 holds the top SWE-bench Verified score at 80.8%. GLM-5, the open-source leader, scores 77.8% on SWE-bench Verified.

SWE-bench Verified (real GitHub bugs)

Opus 4.6

80.8 ★

Sonnet 4.6

79.6%

Gemini 3.1

80.6%

GPT-5.4

~77%

Grok 4.20

~71%

DeepSeek

67.8%

MiniMax

N/R

GLM-5

77.8%

SWE-bench Pro (harder tasks)

Opus 4.6

N/A

Sonnet 4.6

N/A

Gemini 3.1

54.2%

GPT-5.4

57.7 ★

Grok 4.20

N/A

DeepSeek

N/A

MiniMax

56.2%

GLM-5

N/A

Terminal-Bench 2.0 (agentic terminal)

Opus 4.6

65.4%

Sonnet 4.6

59.1%

Gemini 3.1

68.5%

GPT-5.4

75.1 ★

Grok 4.20

N/R

DeepSeek

31.3%

MiniMax

57%

GLM-5

56.2%

OSWorld-Verified (computer-use agent)

Opus 4.6

72.7%

Sonnet 4.6

72.5%

Gemini 3.1

N/R

GPT-5.4

75 ★

Grok 4.20

N/R

DeepSeek

N/R

MiniMax

N/R

GLM-5

N/R

LiveCodeBench (competitive programming)

Opus 4.6

~62%

Sonnet 4.6

~58%

Gemini 3.1

75.6 ★

GPT-5.4

72.5%

Grok 4.20

~77%

DeepSeek

74.1%

MiniMax

N/R

GLM-5

N/R

Section 03

Which AI Model is Best for Agentic Tasks and Tool Use?

GPT-5.4 leads office productivity with a GDPval-AA score of 1,667 and web research at 82.7% on BrowseComp. Claude Opus 4.6 leads customer service agent tasks (93.5% tau2-bench Retail). GLM-5 brings open-source agentic capabilities with 89.7% on tau2-bench.

GDPval-AA bars normalized to 1,700 max.

tau2-bench Retail (customer service agent)

Opus 4.6

93.5 ★

Sonnet 4.6

91.7%

Gemini 3.1

90.8%

GPT-5.4

N/R

Grok 4.20

N/A

DeepSeek

N/A

MiniMax

N/A

GLM-5

89.7%

GDPval-AA (office productivity)

Opus 4.6

1,606

Sonnet 4.6

1,633

Gemini 3.1

1,317

GPT-5.4

1,667 ★

Grok 4.20

N/A

DeepSeek

N/A

MiniMax

1,495

GLM-5

N/A

BrowseComp (web research agent)

Opus 4.6

84%

Sonnet 4.6

N/A

Gemini 3.1

85.9 ★

GPT-5.4

82.7%

Grok 4.20

N/A

DeepSeek

N/A

MiniMax

N/A

GLM-5

62%

Finance Agent

Opus 4.6

60.7%

Sonnet 4.6

63.3 ★

Gemini 3.1

N/A

GPT-5.4

N/A

Grok 4.20

N/A

DeepSeek

N/A

MiniMax

N/A

GLM-5

N/A

Section 04

Which AI Model Has the Best Multimodal and Long Context Support?

GPT-5.4 leads vision reasoning at 90.6% on MMMU-Pro (Pro variant). Claude Opus 4.6 leads extreme 1-million-token context at 76.0% on MRCR v2. DeepSeek V3.2 and GLM-5 are text-only with no vision support. MiniMax M2.7 scores 83.5% on MMMU.

MMMU-Pro (vision + reasoning)

Opus 4.6

73.9%

Sonnet 4.6

~68%

Gemini 3.1

80.5%

GPT-5.4

90.6† ★

Grok 4.20

N/A

DeepSeek

text-only

MiniMax

~83.5%

GLM-5

text-only

MRCR v2 128K (long-context retrieval)

Opus 4.6

84.9 ★

Sonnet 4.6

~81%

Gemini 3.1

84.9 ★

GPT-5.4

~85%

Grok 4.20

N/A

DeepSeek

N/A

MiniMax

N/A

GLM-5

N/A

MRCR v2 1M tokens (extreme long context)

Opus 4.6

76 ★

Sonnet 4.6

N/A

Gemini 3.1

26.3†

GPT-5.4

~37%

Grok 4.20

N/A

DeepSeek

N/A

MiniMax

N/A

GLM-5

N/A

Section 05

Context Windows, API Access, and Release Dates

Full specifications for all eight frontier models as of March 20, 2026, including pricing, output speed, context window size, and the Artificial Analysis Intelligence Index composite score.

	Opus 4.6	Sonnet 4.6	Gemini 3.1	GPT-5.4	Grok 4.20	DeepSeek	MiniMax	GLM-5
Released	Feb 5 '26	Feb 17 '26	Feb 19 '26	Mar 5 '26	Mar 10 '26	Dec 1 '25	Mar 18 '26	Feb 11 '26
Context Window	200K (1Mβ)	200K (1Mβ)	1M std	1.05M	2M	128K	~200K	200K
Input $/1M tokens	$5.00	$3.00	$2.00	$2.50	$2.00	$0.28	$0.30	~$1.00
Output $/1M tokens	$25.00	$15.00	$12.00	$15.00	$6.00	$0.42	$1.20	~$3.20
Speed (tokens/sec)	67-72	54-56	~120	~82	212+	49	~49	~85
Arena Elo	~1,501	Top 5	~1,493	~1,480†	~1,493	N/R	1,407†	1,455
AA Intelligence Index	53	52	57	57	48	32	50	50
Open weights	No	No	No	No	No	MIT ✓	No	MIT ✓

Section 06

Which Frontier AI Model Offers the Best Value?

Output cost and inference speed determine which model is practical at scale. Grok 4.20 Beta leads speed at 212+ tokens per second. DeepSeek V3.2 leads cost at $0.42 per million output tokens. MiniMax M2.7 offers the best balance of intelligence and price.

Output Price: $/1M Tokens

Opus 4.6

$25.00

Sonnet 4.6

$15.00

Gemini 3.1

$12.00

GPT-5.4

$15.00

Grok 4.20

$6.00

DeepSeek

$0.42

MiniMax

$1.20

GLM-5

$3.20

Speed: Tokens per Second

Opus 4.6

67-72 t/s

Sonnet 4.6

54-56 t/s

Gemini 3.1

~120 t/s

GPT-5.4

~82 t/s

Grok 4.20

212+ t/s

DeepSeek

49 t/s

MiniMax

~49 t/s

GLM-5

~85 t/s

Section 07

Where Each Model Wins

Claude Opus 4.6

Best for complex agentic work

→SWE-bench Verified #1 (80.8%)
→HLE with Tools leader (53.0%)
→tau2-bench Retail (93.5%)
→Arena Elo #1 (~1,501)
→1M-token long context (76.0%)

Claude Sonnet 4.6

Best production value (Anthropic)

→Finance agent leader (63.3%)
→Near-Opus quality at 60% cost
→79.6% SWE-bench Verified
→OSWorld 72.5%

Gemini 3.1 Pro

Best reasoning + price-performance

→ARC-AGI-2 leader (77.1%)
→GPQA Diamond leader (94.3%)
→HLE no-tools co-leader (44.4%)
→AA Intelligence Index co-leader (57)
→BrowseComp leader (85.9%)
→1M context standard

GPT-5.4

Best for autonomous tasks

→OSWorld leader (75.0%, beats humans)
→SWE-bench Pro leader (57.7%)
→Terminal-Bench 2.0 leader (75.1%)
→AIME 2025 perfect (100%)
→GDPval-AA #1 (1,667)
→AA Intelligence Index co-leader (57)

Grok 4.20 Beta

Fastest frontier model

→212+ tokens/sec (2x faster than any rival)
→Largest context window (2M tokens)
→Non-hallucination record (78%)
→$6/M output (5x cheaper than top-tier)

DeepSeek V3.2

Cheapest API

→$0.42/M output (cheapest frontier API)
→AIME 2025 (96.0%)
→LiveCodeBench (74.1%)
→MIT open weights
→Self-hostable infrastructure

MiniMax M2.7

Best value frontier intelligence

→AA Index 50 at $1.20/M output
→Self-evolving (30-50% own RL research)
→SWE-bench Pro (56.2%)
→GDPval-AA Elo (1,495)
→41.7 pts/$ price-performance

GLM-5

Best open-source model

→AA Index 50 (#1 open-weight)
→SWE-bench Verified 77.8% (#1 open)
→HLE with Tools 50.4%
→744B/40B MoE, MIT license
→Zero NVIDIA: Huawei Ascend trained

Original Analysis

Price-Performance Index

Attainment's Price-Performance Index divides each model's AA Intelligence Index score by its output cost per million tokens. Higher is better. MiniMax M2.7 enters the report at 41.7 pts/$, the second-best ratio behind DeepSeek V3.2.

Model	AA Index	Output $/1M	Price-Performance
DeepSeek V3.2	32	$0.42	76.2 pts/$
MiniMax M2.7	50	$1.20	41.7 pts/$
GLM-5	50	$3.20	15.6 pts/$
Grok 4.20 Beta	48	$6.00	8.0 pts/$
Gemini 3.1 Pro	57	$12.00	4.8 pts/$
GPT-5.4	57	$15.00	3.8 pts/$
Claude Sonnet 4.6	52	$15.00	3.5 pts/$
Claude Opus 4.6	53	$25.00	2.1 pts/$

Price-Performance = AA Intelligence Index divided by output cost per million tokens. Higher score means more intelligence per dollar. Calculated by Attainment, March 2026.

Methodology

How We Scored These Models

Benchmark scores were collected from official model cards, provider documentation, and independent evaluation platforms. All data reflects published results as of March 20, 2026. We selected benchmarks that measure distinct capabilities: PhD-level science knowledge (GPQA Diamond), abstract reasoning (ARC-AGI-2), real-world software engineering (SWE-bench), autonomous computing (OSWorld), and vision reasoning (MMMU-Pro). The AA Intelligence Index from Artificial Analysis provides a composite score normalized across evaluations. This is the second edition in a monthly series.

Primary Sources

GPQA Diamond (Google DeepMind)
ARC-AGI-2 (ARC Prize Foundation)
Humanity's Last Exam (MATS Research)
SWE-bench (Princeton NLP)
OSWorld (OSWorld Benchmark)
MMMU-Pro (MMMU Benchmark)
Arena (formerly LMSYS) (Arena.ai)
Artificial Analysis Intelligence Index (Artificial Analysis)

Summary

Key Takeaways

1
GPT-5.4: GPT-5.4 is the biggest mover this month. It surpasses human experts on OSWorld (75.0% vs 72.4%), leads SWE-bench Pro (57.7%), Terminal-Bench 2.0 (75.1%), and GDPval-AA (1,667). AA Intelligence Index ties Gemini at 57.
2
Gemini 3.1 Pro: Gemini 3.1 Pro holds the reasoning crown: GPQA Diamond 94.3%, ARC-AGI-2 77.1%, AA Intelligence Index 57. At $12/M output, it remains the best reasoning-per-dollar among top-tier models.
3
Claude Opus 4.6: Claude Opus 4.6 retains the human preference lead (Arena Elo ~1,501) and leads agentic benchmarks: HLE with Tools 53.0%, tau2-bench Retail 93.5%, SWE-bench Verified 80.8%.
4
MiniMax M2.7: MiniMax M2.7 is the value story of the month. AA Intelligence Index 50 at $1.20/M output gives it 41.7 pts/$ price-performance. It autonomously handled 30-50% of its own reinforcement learning research.
5
GLM-5: GLM-5 is the first open-weight model to hit 50 on the AA Intelligence Index. SWE-bench Verified 77.8%, HLE with Tools 50.4%, MIT license. Trained entirely on Huawei Ascend chips.
6
Grok 4.20 Beta: Grok 4.20 Beta is the fastest at 212+ tokens/sec with the largest context (2M tokens). Its 78% non-hallucination rate sets a new record. At $6/M output, it trades raw intelligence for speed and honesty.
7
Claude Sonnet 4.6: Claude Sonnet 4.6 is the quiet workhorse: Finance Agent leader (63.3%), OSWorld 72.5%, SWE-bench Verified 79.6%, all at 60% of Opus pricing.
8
DeepSeek V3.2: DeepSeek V3.2 remains the cheapest API at $0.42/M output, with strong math (AIME 96.0%) and coding (LiveCodeBench 74.1%). V4 is expected in April.

FAQ

Frequently Asked Questions

Which AI model is best for reasoning in March 2026?: Gemini 3.1 Pro leads reasoning benchmarks, scoring 94.3% on GPQA Diamond, 77.1% on ARC-AGI-2, and 44.4% on Humanity's Last Exam without tools. GPT-5.4 is a strong second at 92.8% GPQA Diamond and 73.3% ARC-AGI-2.
Which AI model is best for coding in March 2026?: Claude Opus 4.6 leads SWE-bench Verified at 80.8%. GPT-5.4 leads the harder SWE-bench Pro at 57.7% and Terminal-Bench 2.0 at 75.1%. GPT-5.4 also became the first model to surpass human experts on OSWorld (75.0% vs 72.4% human baseline).
What is the fastest AI model in March 2026?: Grok 4.20 Beta is the fastest at 212 to 232 tokens per second. Gemini 3.1 Pro is second at approximately 120 tokens per second. GLM-5 runs at 85 tokens per second via third-party providers.
What is the cheapest frontier AI model in March 2026?: DeepSeek V3.2 is cheapest at $0.42 per million output tokens. MiniMax M2.7 offers the best value at frontier intelligence: AA Intelligence Index 50 at $1.20 per million output. GLM-5 (MIT open-weight, AA Index 50) is available at roughly $3.20 per million output.
What is MiniMax M2.7?: MiniMax M2.7 is a proprietary reasoning model from Chinese AI lab MiniMax, released March 18, 2026. It uses a 230B-parameter mixture-of-experts architecture with roughly 10B active parameters per token. It scores 50 on the AA Intelligence Index at $1.20 per million output, and autonomously handled 30 to 50% of its own reinforcement learning research during training.
What is GLM-5?: GLM-5 is an open-weight model from Zhipu AI (February 2026, MIT license). It has 744B total parameters with 40B active. It is the first open-weight model to score 50 on the AA Intelligence Index. It was trained on 100,000 Huawei Ascend chips with zero NVIDIA dependency.
How does GPT-5.4 compare to GPT-5.3 Codex?: GPT-5.4 improves on GPT-5.3 Codex across the board. OSWorld jumps from 64.7% to 75.0%. SWE-bench Pro rises from 56.8% to 57.7%. The context window expands to 1.05 million tokens. AA Intelligence Index goes from N/R to 57 (tied for #1). Output pricing is $15 per million versus $14 for GPT-5.3.
Which frontier AI model has the largest context window?: Grok 4.20 Beta leads at 2 million tokens. GPT-5.4 offers 1.05 million. Gemini 3.1 Pro offers 1 million standard. Claude Opus 4.6 and Sonnet 4.6 support 200K with 1 million in beta.

David Cyrus

Founder, Attainment

LinkedIn·About

Attainment is an AI-powered growth and operations firm based in Toronto. This report is produced independently with no paid placement or model provider sponsorship. All benchmark data is sourced from official model cards and third-party evaluations as of March 20, 2026. This is the second edition in a monthly series.

Published March 20, 2026·February 2026 edition

† GPT-5.4 MMMU-Pro score is from the Pro variant ($30/M input). GPT-5.4 and MiniMax M2.7 Arena Elo scores are preliminary.

N/R = not run. N/A = not applicable. ~ = estimated. ★ = category leader.

Sources: Anthropic, Google DeepMind, OpenAI, xAI, DeepSeek, MiniMax, Zhipu AI model cards; Artificial Analysis Intelligence Index v4.0; Arena.ai (formerly LMSYS); Vals AI; Digital Applied LLM Comparison. All scores as of March 20, 2026.