Model BenchmarksMay 28, 2025·14 min read

GPT-4o vs Claude 3.5 Sonnet vs Mistral Large 2:
200K Tokens of Real Enterprise Workloads

Academic benchmarks are marketing. MMLU scores don't predict your support ticket quality. HumanEval doesn't predict your TypeScript coverage. We ran real workloads. The results break several assumptions.

Methodology note

All tests run between May 12–22, 2025 using production API endpoints (no cached responses). Models tested against identical prompts across four workload categories. Human evaluators (n=3, blind review) scored quality. Cost calculated from actual token usage via TokenFin instrumentation — not estimated.

Models and current pricing

Model	Input / 1M tokens	Output / 1M tokens	Context
GPT-4o (2024-05-13)	$2.50	$10.00	128K
Claude 3.5 Sonnet	$3.00	$15.00	200K
Mistral Large 2	$2.00	$6.00	128K
Gemini 1.5 Pro	$3.50	$10.50	1M

Note: Claude 3.5 Sonnet costs 2.5× Mistral Large 2 on output. This difference matters enormously at scale — a feature generating 10M output tokens/month pays $150K (Claude) vs $60K (Mistral). That gap funds an engineer.

Test workloads

Code generation

50K tokens

Python FastAPI endpoints, TypeScript React components, SQL query optimisation, unit test generation. Evaluated on: correctness (does it run?), idiomaticity, edge case handling.

Document summarisation

60K tokens

Legal contracts (avg 8K tokens), technical RFPs, financial quarterly reports. Evaluated on: factual accuracy, key point coverage, no hallucinations.

Structured extraction

50K tokens

JSON extraction from unstructured text — nested schemas, optional fields, type coercion. Evaluated on: schema compliance rate, field accuracy, graceful handling of missing data.

Multi-turn support chat

40K tokens

Customer support conversations, 8–14 turns each. Topics: billing, technical troubleshooting, account management. Evaluated on: resolution rate, tone, escalation accuracy.

Overall results

Model	Cost/1K output	p50 latency	p95 latency	Quality	Value score
GPT-4o	$0.041	820ms	2,100ms	8.9/10	7.2/10
Claude 3.5 SonnetBEST QUALITY	$0.027	640ms	1,400ms	9.2/10	8.8/10
Mistral Large 2	$0.018	480ms	980ms	8.4/10	9.1/10
Gemini 1.5 Pro	$0.021	590ms	1,200ms	8.6/10	8.5/10

Value score = quality / (cost × latency factor). Mistral wins on pure value. Claude wins on quality. GPT-4o is the most expensive for what you get.

Breakdown by workload category

01 — Code Generation

Model	Correctness	Idiomatic	Edge cases	Overall
GPT-4o	94%	8.8/10	7.9/10	9.0/10
Claude 3.5 Sonnet	96%	9.4/10	9.1/10	9.4/10
Mistral Large 2	89%	8.2/10	7.4/10	8.3/10
Gemini 1.5 Pro	91%	8.5/10	7.7/10	8.6/10

Finding: Claude 3.5 Sonnet wins convincingly on code. Its edge case handling is particularly strong — it consistently wrote guard clauses and null checks that GPT-4o missed. For TypeScript specifically, Claude generated more type-safe code with 23% fewer type errors on first run.

02 — Document Summarisation

Model	Factual accuracy	Key point coverage	Hallucination rate	Overall
GPT-4o	91%	88%	2.1%	8.8/10
Claude 3.5 Sonnet	94%	93%	0.8%	9.3/10
Mistral Large 2	87%	83%	3.4%	8.2/10
Gemini 1.5 Pro	90%	87%	2.8%	8.7/10

Finding: Claude's 0.8% hallucination rate vs Mistral's 3.4% is the critical differentiator for legal and financial documents. In a 1,000-document pipeline, Mistral produces ~34 hallucinated summaries vs Claude's ~8. For compliance-sensitive workloads, that gap is not acceptable. GPT-4o's 2.1% is defensible but Claude leads.

03 — Structured JSON Extraction

Model	Schema compliance	Field accuracy	Nested objects	Null handling
GPT-4o	98.2%	96.4%	94.1%	97.8%
Claude 3.5 Sonnet	97.9%	97.1%	95.3%	98.4%
Mistral Large 2	93.4%	91.2%	86.7%	92.1%
Gemini 1.5 Pro	95.8%	94.3%	91.2%	95.6%

Finding: GPT-4o and Claude are statistically tied on structured extraction. Both outperform Mistral significantly on nested objects (94% vs 87%). This is where Mistral's cost advantage evaporates — broken extractions require retries, which erode the cost saving and add latency. For complex schemas, use GPT-4o or Claude. For flat schemas, Mistral is fine.

04 — Multi-turn Customer Support

Model	Resolution rate	Tone score	Context retention	Escalation accuracy
GPT-4o	78%	8.7/10	96%	91%
Claude 3.5 Sonnet	84%	9.1/10	98%	94%
Mistral Large 2	71%	8.3/10	91%	86%
Gemini 1.5 Pro	74%	8.5/10	93%	88%

Finding: Claude's 84% resolution rate vs Mistral's 71% is a 13-point gap. In a support system handling 10,000 tickets/month, that's 1,300 additional tickets that need human escalation with Mistral. At $15/ticket average human handling cost, that's $19,500/month in hidden cost — more than the $90K/year difference in model cost at that volume.

The architecture recommendation

Single-model architectures are a relic of 2023. The correct answer in 2025 is a task-router pattern:

// Task-router pattern — production architecture

const router = {

code_generation: 'claude-3-5-sonnet', // highest quality, worth the cost

legal_summarisation: 'claude-3-5-sonnet', // hallucination rate critical

json_extraction: 'gpt-4o', // tied with Claude, cheaper for structured

batch_summarisation: 'mistral-large-2', // nightly jobs, cost wins

internal_tools: 'gpt-4o-mini', // no revenue impact, minimize cost

dev_staging: 'gpt-4o-mini', // always cheap in non-prod

}

Teams that implement a task router and track cost per route via attribution typically reduce total LLM spend by 35–50% compared to single-model architectures, with no quality regression on revenue-generating features.

TL;DR

🏆 Best quality: Claude 3.5 Sonnet — wins code, docs, support, hallucination rate

💰 Best value: Mistral Large 2 — 55% cheaper than Claude, acceptable for batch/internal

⚡ Fastest: Mistral at 480ms p50 — 41% faster than GPT-4o

🏗️ Best architecture: Task router — right model for each job, tracked per route

❌ Avoid: Single-model architecture at scale — you are overpaying on some routes, underserving on others

← All posts Next: EU AI Act engineering guide →

GPT-4o vs Claude 3.5 Sonnet vs Mistral Large 2:200K Tokens of Real Enterprise Workloads