Skip to main content

Frontier AI Model Report

March 2026

8 frontier models compared across reasoning, coding, agentic tasks, multimodal, speed, and pricing. For executives, founders, and operators.

Claude Opus 4.6
Claude Sonnet 4.6
Gemini 3.1 Pro
GPT-5.4
Grok 4.20 Beta
DeepSeek V3.2
MiniMax M2.7
GLM-5

What This Report Covers

Side-by-side benchmark results for 8 frontier AI models across reasoning, coding, agentic tasks, multimodal, long context, price, and speed. This edition adds GPT-5.4, Grok 4.20 Beta, MiniMax M2.7, and GLM-5 to the comparison. Sourced from official model cards and third-party evaluations as of March 20, 2026.

Bottom Line

No single model wins every category. GPT-5.4 is the biggest mover, surpassing human experts on OSWorld. Gemini 3.1 Pro holds the reasoning crown. Claude Opus 4.6 leads human preference and agentic work. MiniMax M2.7 and GLM-5 prove frontier intelligence is no longer exclusive to Western labs or closed models.

What Changed This Month

View February edition →
NEW

GPT-5.4 surpasses human experts

First model to beat human baseline on OSWorld (75.0% vs 72.4%). 1.05M context window. Replaces GPT-5.3 Codex in this report.

NEW

Grok 4.20 Beta: 212+ tokens/sec

Nearly 2x the speed of any competitor. Non-hallucination record of 78%. 2M token context. Replaces Grok 4.1.

NEW

MiniMax M2.7 enters the report

AA Intelligence Index 50 at $1.20/M output. Self-evolving model that handled 30-50% of its own RL research.

NEW

GLM-5 enters as open-source leader

First open-weight model to hit AA Index 50. 744B/40B MoE, MIT license. Trained entirely on Huawei chips.

Quick Answers

Best for reasoning

Gemini 3.1 Pro

94.3% GPQA Diamond

Best for coding

GPT-5.4

75.0% OSWorld (beats humans)

Best for agentic tasks

Claude Opus 4.6

53.0% HLE with Tools

Best value

MiniMax M2.7

AA Index 50, $1.20/M output

Section 01

Which AI Model Has the Best Reasoning?

Gemini 3.1 Pro leads PhD-level science and abstract reasoning in March 2026, scoring highest on GPQA Diamond (94.3%) and ARC-AGI-2 (77.1%). GPT-5.4 is a strong second at 92.8% GPQA Diamond and 73.3% ARC-AGI-2. On Humanity's Last Exam with tools, Claude Opus 4.6 leads at 53.0%.

GPQA Diamond (PhD-level science)

Opus 4.6
91.3%
Sonnet 4.6
74.1%
Gemini 3.1
94.3 ★
GPT-5.4
92.8%
Grok 4.20
~88.5%
DeepSeek
79.9%
MiniMax
~88.6%
GLM-5
86%
GPQA Diamond (PhD-level science): Claude Opus 4.6
91.3%
GPQA Diamond (PhD-level science): Claude Sonnet 4.6
74.1%
GPQA Diamond (PhD-level science): Gemini 3.1 Pro
94.3%
GPQA Diamond (PhD-level science): GPT-5.4
92.8%
GPQA Diamond (PhD-level science): Grok 4.20 Beta
~88.5%
GPQA Diamond (PhD-level science): DeepSeek V3.2
79.9%
GPQA Diamond (PhD-level science): MiniMax M2.7
~88.6%
GPQA Diamond (PhD-level science): GLM-5
86%

ARC-AGI-2 (novel abstract reasoning)

Opus 4.6
68.8%
Sonnet 4.6
60.4%
Gemini 3.1
77.1 ★
GPT-5.4
73.3%
Grok 4.20
N/R
DeepSeek
N/R
MiniMax
N/R
GLM-5
N/R
ARC-AGI-2 (novel abstract reasoning): Claude Opus 4.6
68.8%
ARC-AGI-2 (novel abstract reasoning): Claude Sonnet 4.6
60.4%
ARC-AGI-2 (novel abstract reasoning): Gemini 3.1 Pro
77.1%
ARC-AGI-2 (novel abstract reasoning): GPT-5.4
73.3%
ARC-AGI-2 (novel abstract reasoning): Grok 4.20 Beta
N/R
ARC-AGI-2 (novel abstract reasoning): DeepSeek V3.2
N/R
ARC-AGI-2 (novel abstract reasoning): MiniMax M2.7
N/R
ARC-AGI-2 (novel abstract reasoning): GLM-5
N/R

Humanity's Last Exam (no tools)

Opus 4.6
40%
Sonnet 4.6
19.1%
Gemini 3.1
44.4 ★
GPT-5.4
42%
Grok 4.20
~44.4%
DeepSeek
N/A
MiniMax
~28%
GLM-5
30.5%
Humanity's Last Exam (no tools): Claude Opus 4.6
40%
Humanity's Last Exam (no tools): Claude Sonnet 4.6
19.1%
Humanity's Last Exam (no tools): Gemini 3.1 Pro
44.4%
Humanity's Last Exam (no tools): GPT-5.4
42%
Humanity's Last Exam (no tools): Grok 4.20 Beta
~44.4%
Humanity's Last Exam (no tools): DeepSeek V3.2
N/A
Humanity's Last Exam (no tools): MiniMax M2.7
~28%
Humanity's Last Exam (no tools): GLM-5
30.5%

HLE with Tools

Opus 4.6
53 ★
Sonnet 4.6
N/A
Gemini 3.1
51.4%
GPT-5.4
52.1%
Grok 4.20
N/A
DeepSeek
N/A
MiniMax
N/A
GLM-5
50.4%
HLE with Tools: Claude Opus 4.6
53%
HLE with Tools: Claude Sonnet 4.6
N/A
HLE with Tools: Gemini 3.1 Pro
51.4%
HLE with Tools: GPT-5.4
52.1%
HLE with Tools: Grok 4.20 Beta
N/A
HLE with Tools: DeepSeek V3.2
N/A
HLE with Tools: MiniMax M2.7
N/A
HLE with Tools: GLM-5
50.4%

AIME 2025 (competition math)

Opus 4.6
~82%
Sonnet 4.6
~70%
Gemini 3.1
95%
GPT-5.4
100 ★
Grok 4.20
~100%
DeepSeek
96%
MiniMax
N/R
GLM-5
N/R
AIME 2025 (competition math): Claude Opus 4.6
~82%
AIME 2025 (competition math): Claude Sonnet 4.6
~70%
AIME 2025 (competition math): Gemini 3.1 Pro
95%
AIME 2025 (competition math): GPT-5.4
100%
AIME 2025 (competition math): Grok 4.20 Beta
~100%
AIME 2025 (competition math): DeepSeek V3.2
96%
AIME 2025 (competition math): MiniMax M2.7
N/R
AIME 2025 (competition math): GLM-5
N/R

MMMLU (multilingual knowledge)

Opus 4.6
91.1%
Sonnet 4.6
79.1%
Gemini 3.1
92.6 ★
GPT-5.4
~88.5%
Grok 4.20
N/A
DeepSeek
N/A
MiniMax
N/A
GLM-5
N/A
MMMLU (multilingual knowledge): Claude Opus 4.6
91.1%
MMMLU (multilingual knowledge): Claude Sonnet 4.6
79.1%
MMMLU (multilingual knowledge): Gemini 3.1 Pro
92.6%
MMMLU (multilingual knowledge): GPT-5.4
~88.5%
MMMLU (multilingual knowledge): Grok 4.20 Beta
N/A
MMMLU (multilingual knowledge): DeepSeek V3.2
N/A
MMMLU (multilingual knowledge): MiniMax M2.7
N/A
MMMLU (multilingual knowledge): GLM-5
N/A

Section 02

Which AI Model is Best for Coding and Software Engineering?

GPT-5.4 is the new coding leader in March 2026, surpassing human experts on OSWorld (75.0% vs 72.4% human baseline) and leading SWE-bench Pro at 57.7%. Claude Opus 4.6 holds the top SWE-bench Verified score at 80.8%. GLM-5, the open-source leader, scores 77.8% on SWE-bench Verified.

SWE-bench Verified (real GitHub bugs)

Opus 4.6
80.8 ★
Sonnet 4.6
79.6%
Gemini 3.1
80.6%
GPT-5.4
~77%
Grok 4.20
~71%
DeepSeek
67.8%
MiniMax
N/R
GLM-5
77.8%
SWE-bench Verified (real GitHub bugs): Claude Opus 4.6
80.8%
SWE-bench Verified (real GitHub bugs): Claude Sonnet 4.6
79.6%
SWE-bench Verified (real GitHub bugs): Gemini 3.1 Pro
80.6%
SWE-bench Verified (real GitHub bugs): GPT-5.4
~77%
SWE-bench Verified (real GitHub bugs): Grok 4.20 Beta
~71%
SWE-bench Verified (real GitHub bugs): DeepSeek V3.2
67.8%
SWE-bench Verified (real GitHub bugs): MiniMax M2.7
N/R
SWE-bench Verified (real GitHub bugs): GLM-5
77.8%

SWE-bench Pro (harder tasks)

Opus 4.6
N/A
Sonnet 4.6
N/A
Gemini 3.1
54.2%
GPT-5.4
57.7 ★
Grok 4.20
N/A
DeepSeek
N/A
MiniMax
56.2%
GLM-5
N/A
SWE-bench Pro (harder tasks): Claude Opus 4.6
N/A
SWE-bench Pro (harder tasks): Claude Sonnet 4.6
N/A
SWE-bench Pro (harder tasks): Gemini 3.1 Pro
54.2%
SWE-bench Pro (harder tasks): GPT-5.4
57.7%
SWE-bench Pro (harder tasks): Grok 4.20 Beta
N/A
SWE-bench Pro (harder tasks): DeepSeek V3.2
N/A
SWE-bench Pro (harder tasks): MiniMax M2.7
56.2%
SWE-bench Pro (harder tasks): GLM-5
N/A

Terminal-Bench 2.0 (agentic terminal)

Opus 4.6
65.4%
Sonnet 4.6
59.1%
Gemini 3.1
68.5%
GPT-5.4
75.1 ★
Grok 4.20
N/R
DeepSeek
31.3%
MiniMax
57%
GLM-5
56.2%
Terminal-Bench 2.0 (agentic terminal): Claude Opus 4.6
65.4%
Terminal-Bench 2.0 (agentic terminal): Claude Sonnet 4.6
59.1%
Terminal-Bench 2.0 (agentic terminal): Gemini 3.1 Pro
68.5%
Terminal-Bench 2.0 (agentic terminal): GPT-5.4
75.1%
Terminal-Bench 2.0 (agentic terminal): Grok 4.20 Beta
N/R
Terminal-Bench 2.0 (agentic terminal): DeepSeek V3.2
31.3%
Terminal-Bench 2.0 (agentic terminal): MiniMax M2.7
57%
Terminal-Bench 2.0 (agentic terminal): GLM-5
56.2%

OSWorld-Verified (computer-use agent)

Opus 4.6
72.7%
Sonnet 4.6
72.5%
Gemini 3.1
N/R
GPT-5.4
75 ★
Grok 4.20
N/R
DeepSeek
N/R
MiniMax
N/R
GLM-5
N/R
OSWorld-Verified (computer-use agent): Claude Opus 4.6
72.7%
OSWorld-Verified (computer-use agent): Claude Sonnet 4.6
72.5%
OSWorld-Verified (computer-use agent): Gemini 3.1 Pro
N/R
OSWorld-Verified (computer-use agent): GPT-5.4
75%
OSWorld-Verified (computer-use agent): Grok 4.20 Beta
N/R
OSWorld-Verified (computer-use agent): DeepSeek V3.2
N/R
OSWorld-Verified (computer-use agent): MiniMax M2.7
N/R
OSWorld-Verified (computer-use agent): GLM-5
N/R

LiveCodeBench (competitive programming)

Opus 4.6
~62%
Sonnet 4.6
~58%
Gemini 3.1
75.6 ★
GPT-5.4
72.5%
Grok 4.20
~77%
DeepSeek
74.1%
MiniMax
N/R
GLM-5
N/R
LiveCodeBench (competitive programming): Claude Opus 4.6
~62%
LiveCodeBench (competitive programming): Claude Sonnet 4.6
~58%
LiveCodeBench (competitive programming): Gemini 3.1 Pro
75.6%
LiveCodeBench (competitive programming): GPT-5.4
72.5%
LiveCodeBench (competitive programming): Grok 4.20 Beta
~77%
LiveCodeBench (competitive programming): DeepSeek V3.2
74.1%
LiveCodeBench (competitive programming): MiniMax M2.7
N/R
LiveCodeBench (competitive programming): GLM-5
N/R

Section 03

Which AI Model is Best for Agentic Tasks and Tool Use?

GPT-5.4 leads office productivity with a GDPval-AA score of 1,667 and web research at 82.7% on BrowseComp. Claude Opus 4.6 leads customer service agent tasks (93.5% tau2-bench Retail). GLM-5 brings open-source agentic capabilities with 89.7% on tau2-bench.

GDPval-AA bars normalized to 1,700 max.

tau2-bench Retail (customer service agent)

Opus 4.6
93.5 ★
Sonnet 4.6
91.7%
Gemini 3.1
90.8%
GPT-5.4
N/R
Grok 4.20
N/A
DeepSeek
N/A
MiniMax
N/A
GLM-5
89.7%
tau2-bench Retail (customer service agent): Claude Opus 4.6
93.5%
tau2-bench Retail (customer service agent): Claude Sonnet 4.6
91.7%
tau2-bench Retail (customer service agent): Gemini 3.1 Pro
90.8%
tau2-bench Retail (customer service agent): GPT-5.4
N/R
tau2-bench Retail (customer service agent): Grok 4.20 Beta
N/A
tau2-bench Retail (customer service agent): DeepSeek V3.2
N/A
tau2-bench Retail (customer service agent): MiniMax M2.7
N/A
tau2-bench Retail (customer service agent): GLM-5
89.7%

GDPval-AA (office productivity)

Opus 4.6
1,606
Sonnet 4.6
1,633
Gemini 3.1
1,317
GPT-5.4
1,667 ★
Grok 4.20
N/A
DeepSeek
N/A
MiniMax
1,495
GLM-5
N/A
GDPval-AA (office productivity): Claude Opus 4.6
1,606 Elo
GDPval-AA (office productivity): Claude Sonnet 4.6
1,633 Elo
GDPval-AA (office productivity): Gemini 3.1 Pro
1,317 Elo
GDPval-AA (office productivity): GPT-5.4
1,667 Elo
GDPval-AA (office productivity): Grok 4.20 Beta
N/A
GDPval-AA (office productivity): DeepSeek V3.2
N/A
GDPval-AA (office productivity): MiniMax M2.7
1,495 Elo
GDPval-AA (office productivity): GLM-5
N/A

BrowseComp (web research agent)

Opus 4.6
84%
Sonnet 4.6
N/A
Gemini 3.1
85.9 ★
GPT-5.4
82.7%
Grok 4.20
N/A
DeepSeek
N/A
MiniMax
N/A
GLM-5
62%
BrowseComp (web research agent): Claude Opus 4.6
84%
BrowseComp (web research agent): Claude Sonnet 4.6
N/A
BrowseComp (web research agent): Gemini 3.1 Pro
85.9%
BrowseComp (web research agent): GPT-5.4
82.7%
BrowseComp (web research agent): Grok 4.20 Beta
N/A
BrowseComp (web research agent): DeepSeek V3.2
N/A
BrowseComp (web research agent): MiniMax M2.7
N/A
BrowseComp (web research agent): GLM-5
62%

Finance Agent

Opus 4.6
60.7%
Sonnet 4.6
63.3 ★
Gemini 3.1
N/A
GPT-5.4
N/A
Grok 4.20
N/A
DeepSeek
N/A
MiniMax
N/A
GLM-5
N/A
Finance Agent: Claude Opus 4.6
60.7%
Finance Agent: Claude Sonnet 4.6
63.3%
Finance Agent: Gemini 3.1 Pro
N/A
Finance Agent: GPT-5.4
N/A
Finance Agent: Grok 4.20 Beta
N/A
Finance Agent: DeepSeek V3.2
N/A
Finance Agent: MiniMax M2.7
N/A
Finance Agent: GLM-5
N/A

Section 04

Which AI Model Has the Best Multimodal and Long Context Support?

GPT-5.4 leads vision reasoning at 90.6% on MMMU-Pro (Pro variant). Claude Opus 4.6 leads extreme 1-million-token context at 76.0% on MRCR v2. DeepSeek V3.2 and GLM-5 are text-only with no vision support. MiniMax M2.7 scores 83.5% on MMMU.

MMMU-Pro (vision + reasoning)

Opus 4.6
73.9%
Sonnet 4.6
~68%
Gemini 3.1
80.5%
GPT-5.4
90.6† ★
Grok 4.20
N/A
DeepSeek
text-only
MiniMax
~83.5%
GLM-5
text-only
MMMU-Pro (vision + reasoning): Claude Opus 4.6
73.9%
MMMU-Pro (vision + reasoning): Claude Sonnet 4.6
~68%
MMMU-Pro (vision + reasoning): Gemini 3.1 Pro
80.5%
MMMU-Pro (vision + reasoning): GPT-5.4
90.6†%
MMMU-Pro (vision + reasoning): Grok 4.20 Beta
N/A
MMMU-Pro (vision + reasoning): DeepSeek V3.2
text-only
MMMU-Pro (vision + reasoning): MiniMax M2.7
~83.5%
MMMU-Pro (vision + reasoning): GLM-5
text-only

MRCR v2 128K (long-context retrieval)

Opus 4.6
84.9 ★
Sonnet 4.6
~81%
Gemini 3.1
84.9 ★
GPT-5.4
~85%
Grok 4.20
N/A
DeepSeek
N/A
MiniMax
N/A
GLM-5
N/A
MRCR v2 128K (long-context retrieval): Claude Opus 4.6
84.9%
MRCR v2 128K (long-context retrieval): Claude Sonnet 4.6
~81%
MRCR v2 128K (long-context retrieval): Gemini 3.1 Pro
84.9%
MRCR v2 128K (long-context retrieval): GPT-5.4
~85%
MRCR v2 128K (long-context retrieval): Grok 4.20 Beta
N/A
MRCR v2 128K (long-context retrieval): DeepSeek V3.2
N/A
MRCR v2 128K (long-context retrieval): MiniMax M2.7
N/A
MRCR v2 128K (long-context retrieval): GLM-5
N/A

MRCR v2 1M tokens (extreme long context)

Opus 4.6
76 ★
Sonnet 4.6
N/A
Gemini 3.1
26.3†
GPT-5.4
~37%
Grok 4.20
N/A
DeepSeek
N/A
MiniMax
N/A
GLM-5
N/A
MRCR v2 1M tokens (extreme long context): Claude Opus 4.6
76%
MRCR v2 1M tokens (extreme long context): Claude Sonnet 4.6
N/A
MRCR v2 1M tokens (extreme long context): Gemini 3.1 Pro
26.3†
MRCR v2 1M tokens (extreme long context): GPT-5.4
~37%
MRCR v2 1M tokens (extreme long context): Grok 4.20 Beta
N/A
MRCR v2 1M tokens (extreme long context): DeepSeek V3.2
N/A
MRCR v2 1M tokens (extreme long context): MiniMax M2.7
N/A
MRCR v2 1M tokens (extreme long context): GLM-5
N/A

Section 05

Context Windows, API Access, and Release Dates

Full specifications for all eight frontier models as of March 20, 2026, including pricing, output speed, context window size, and the Artificial Analysis Intelligence Index composite score.

Opus 4.6Sonnet 4.6Gemini 3.1GPT-5.4Grok 4.20DeepSeekMiniMaxGLM-5
ReleasedFeb 5 '26Feb 17 '26Feb 19 '26Mar 5 '26Mar 10 '26Dec 1 '25Mar 18 '26Feb 11 '26
Context Window200K (1Mβ)200K (1Mβ)1M std1.05M2M128K~200K200K
Input $/1M tokens$5.00$3.00$2.00$2.50$2.00$0.28$0.30~$1.00
Output $/1M tokens$25.00$15.00$12.00$15.00$6.00$0.42$1.20~$3.20
Speed (tokens/sec)67-7254-56~120~82212+49~49~85
Arena Elo~1,501Top 5~1,493~1,480†~1,493N/R1,407†1,455
AA Intelligence Index5352575748325050
Open weightsNoNoNoNoNoMIT ✓NoMIT ✓

Section 06

Which Frontier AI Model Offers the Best Value?

Output cost and inference speed determine which model is practical at scale. Grok 4.20 Beta leads speed at 212+ tokens per second. DeepSeek V3.2 leads cost at $0.42 per million output tokens. MiniMax M2.7 offers the best balance of intelligence and price.

Output Price: $/1M Tokens

Opus 4.6
$25.00
Sonnet 4.6
$15.00
Gemini 3.1
$12.00
GPT-5.4
$15.00
Grok 4.20
$6.00
DeepSeek
$0.42
MiniMax
$1.20
GLM-5
$3.20

Speed: Tokens per Second

Opus 4.6
67-72 t/s
Sonnet 4.6
54-56 t/s
Gemini 3.1
~120 t/s
GPT-5.4
~82 t/s
Grok 4.20
212+ t/s
DeepSeek
49 t/s
MiniMax
~49 t/s
GLM-5
~85 t/s

Section 07

Where Each Model Wins

Claude Opus 4.6

Best for complex agentic work

  • SWE-bench Verified #1 (80.8%)
  • HLE with Tools leader (53.0%)
  • tau2-bench Retail (93.5%)
  • Arena Elo #1 (~1,501)
  • 1M-token long context (76.0%)

Claude Sonnet 4.6

Best production value (Anthropic)

  • Finance agent leader (63.3%)
  • Near-Opus quality at 60% cost
  • 79.6% SWE-bench Verified
  • OSWorld 72.5%

Gemini 3.1 Pro

Best reasoning + price-performance

  • ARC-AGI-2 leader (77.1%)
  • GPQA Diamond leader (94.3%)
  • HLE no-tools co-leader (44.4%)
  • AA Intelligence Index co-leader (57)
  • BrowseComp leader (85.9%)
  • 1M context standard

GPT-5.4

Best for autonomous tasks

  • OSWorld leader (75.0%, beats humans)
  • SWE-bench Pro leader (57.7%)
  • Terminal-Bench 2.0 leader (75.1%)
  • AIME 2025 perfect (100%)
  • GDPval-AA #1 (1,667)
  • AA Intelligence Index co-leader (57)

Grok 4.20 Beta

Fastest frontier model

  • 212+ tokens/sec (2x faster than any rival)
  • Largest context window (2M tokens)
  • Non-hallucination record (78%)
  • $6/M output (5x cheaper than top-tier)

DeepSeek V3.2

Cheapest API

  • $0.42/M output (cheapest frontier API)
  • AIME 2025 (96.0%)
  • LiveCodeBench (74.1%)
  • MIT open weights
  • Self-hostable infrastructure

MiniMax M2.7

Best value frontier intelligence

  • AA Index 50 at $1.20/M output
  • Self-evolving (30-50% own RL research)
  • SWE-bench Pro (56.2%)
  • GDPval-AA Elo (1,495)
  • 41.7 pts/$ price-performance

GLM-5

Best open-source model

  • AA Index 50 (#1 open-weight)
  • SWE-bench Verified 77.8% (#1 open)
  • HLE with Tools 50.4%
  • 744B/40B MoE, MIT license
  • Zero NVIDIA: Huawei Ascend trained

Original Analysis

Price-Performance Index

Attainment's Price-Performance Index divides each model's AA Intelligence Index score by its output cost per million tokens. Higher is better. MiniMax M2.7 enters the report at 41.7 pts/$, the second-best ratio behind DeepSeek V3.2.

ModelAA IndexOutput $/1MPrice-Performance
DeepSeek V3.232$0.4276.2 pts/$
MiniMax M2.750$1.2041.7 pts/$
GLM-550$3.2015.6 pts/$
Grok 4.20 Beta48$6.008.0 pts/$
Gemini 3.1 Pro57$12.004.8 pts/$
GPT-5.457$15.003.8 pts/$
Claude Sonnet 4.652$15.003.5 pts/$
Claude Opus 4.653$25.002.1 pts/$

Price-Performance = AA Intelligence Index divided by output cost per million tokens. Higher score means more intelligence per dollar. Calculated by Attainment, March 2026.

Methodology

How We Scored These Models

Benchmark scores were collected from official model cards, provider documentation, and independent evaluation platforms. All data reflects published results as of March 20, 2026. We selected benchmarks that measure distinct capabilities: PhD-level science knowledge (GPQA Diamond), abstract reasoning (ARC-AGI-2), real-world software engineering (SWE-bench), autonomous computing (OSWorld), and vision reasoning (MMMU-Pro). The AA Intelligence Index from Artificial Analysis provides a composite score normalized across evaluations. This is the second edition in a monthly series.

Primary Sources

Summary

Key Takeaways

  1. 1

    GPT-5.4: GPT-5.4 is the biggest mover this month. It surpasses human experts on OSWorld (75.0% vs 72.4%), leads SWE-bench Pro (57.7%), Terminal-Bench 2.0 (75.1%), and GDPval-AA (1,667). AA Intelligence Index ties Gemini at 57.

  2. 2

    Gemini 3.1 Pro: Gemini 3.1 Pro holds the reasoning crown: GPQA Diamond 94.3%, ARC-AGI-2 77.1%, AA Intelligence Index 57. At $12/M output, it remains the best reasoning-per-dollar among top-tier models.

  3. 3

    Claude Opus 4.6: Claude Opus 4.6 retains the human preference lead (Arena Elo ~1,501) and leads agentic benchmarks: HLE with Tools 53.0%, tau2-bench Retail 93.5%, SWE-bench Verified 80.8%.

  4. 4

    MiniMax M2.7: MiniMax M2.7 is the value story of the month. AA Intelligence Index 50 at $1.20/M output gives it 41.7 pts/$ price-performance. It autonomously handled 30-50% of its own reinforcement learning research.

  5. 5

    GLM-5: GLM-5 is the first open-weight model to hit 50 on the AA Intelligence Index. SWE-bench Verified 77.8%, HLE with Tools 50.4%, MIT license. Trained entirely on Huawei Ascend chips.

  6. 6

    Grok 4.20 Beta: Grok 4.20 Beta is the fastest at 212+ tokens/sec with the largest context (2M tokens). Its 78% non-hallucination rate sets a new record. At $6/M output, it trades raw intelligence for speed and honesty.

  7. 7

    Claude Sonnet 4.6: Claude Sonnet 4.6 is the quiet workhorse: Finance Agent leader (63.3%), OSWorld 72.5%, SWE-bench Verified 79.6%, all at 60% of Opus pricing.

  8. 8

    DeepSeek V3.2: DeepSeek V3.2 remains the cheapest API at $0.42/M output, with strong math (AIME 96.0%) and coding (LiveCodeBench 74.1%). V4 is expected in April.

FAQ

Frequently Asked Questions

Which AI model is best for reasoning in March 2026?
Gemini 3.1 Pro leads reasoning benchmarks, scoring 94.3% on GPQA Diamond, 77.1% on ARC-AGI-2, and 44.4% on Humanity's Last Exam without tools. GPT-5.4 is a strong second at 92.8% GPQA Diamond and 73.3% ARC-AGI-2.
Which AI model is best for coding in March 2026?
Claude Opus 4.6 leads SWE-bench Verified at 80.8%. GPT-5.4 leads the harder SWE-bench Pro at 57.7% and Terminal-Bench 2.0 at 75.1%. GPT-5.4 also became the first model to surpass human experts on OSWorld (75.0% vs 72.4% human baseline).
What is the fastest AI model in March 2026?
Grok 4.20 Beta is the fastest at 212 to 232 tokens per second. Gemini 3.1 Pro is second at approximately 120 tokens per second. GLM-5 runs at 85 tokens per second via third-party providers.
What is the cheapest frontier AI model in March 2026?
DeepSeek V3.2 is cheapest at $0.42 per million output tokens. MiniMax M2.7 offers the best value at frontier intelligence: AA Intelligence Index 50 at $1.20 per million output. GLM-5 (MIT open-weight, AA Index 50) is available at roughly $3.20 per million output.
What is MiniMax M2.7?
MiniMax M2.7 is a proprietary reasoning model from Chinese AI lab MiniMax, released March 18, 2026. It uses a 230B-parameter mixture-of-experts architecture with roughly 10B active parameters per token. It scores 50 on the AA Intelligence Index at $1.20 per million output, and autonomously handled 30 to 50% of its own reinforcement learning research during training.
What is GLM-5?
GLM-5 is an open-weight model from Zhipu AI (February 2026, MIT license). It has 744B total parameters with 40B active. It is the first open-weight model to score 50 on the AA Intelligence Index. It was trained on 100,000 Huawei Ascend chips with zero NVIDIA dependency.
How does GPT-5.4 compare to GPT-5.3 Codex?
GPT-5.4 improves on GPT-5.3 Codex across the board. OSWorld jumps from 64.7% to 75.0%. SWE-bench Pro rises from 56.8% to 57.7%. The context window expands to 1.05 million tokens. AA Intelligence Index goes from N/R to 57 (tied for #1). Output pricing is $15 per million versus $14 for GPT-5.3.
Which frontier AI model has the largest context window?
Grok 4.20 Beta leads at 2 million tokens. GPT-5.4 offers 1.05 million. Gemini 3.1 Pro offers 1 million standard. Claude Opus 4.6 and Sonnet 4.6 support 200K with 1 million in beta.

David Cyrus

Founder, Attainment

LinkedIn·About

Attainment is an AI-powered growth and operations firm based in Toronto. This report is produced independently with no paid placement or model provider sponsorship. All benchmark data is sourced from official model cards and third-party evaluations as of March 20, 2026. This is the second edition in a monthly series.

Published ·February 2026 edition

† GPT-5.4 MMMU-Pro score is from the Pro variant ($30/M input). GPT-5.4 and MiniMax M2.7 Arena Elo scores are preliminary.

N/R = not run.   N/A = not applicable.   ~ = estimated.   ★ = category leader.

Sources: Anthropic, Google DeepMind, OpenAI, xAI, DeepSeek, MiniMax, Zhipu AI model cards; Artificial Analysis Intelligence Index v4.0; Arena.ai (formerly LMSYS); Vals AI; Digital Applied LLM Comparison. All scores as of March 20, 2026.