GPT-5.2 Tops GDPval-AA Benchmark but at a Steep Cost
By Administrator

OpenAI's GPT-5.2 has claimed the top spot on the GDPval-AA benchmark, surpassing Claude Opus 4.5, but its high operational cost raises questions about its economic viability.
OpenAI's latest model, GPT-5.2, has taken the lead on the GDPval-AA benchmark, a test designed to evaluate AI performance on real-world economically valuable tasks such as finance and healthcare. According to Artificial Analysis, GPT-5.2 achieved the highest score, overtaking Claude Opus 4.5, with an ELO rating of 1474. This milestone highlights the model's advanced agentic capabilities, supported by the Stirrup harness, which simulates shell access and web browsing.
However, the victory comes at a significant cost. Running GPT-5.2 on the 220-task benchmark cost $620, a stark contrast to the $88 required for its predecessor, GPT-5.1. This increase is driven by GPT-5.2's use of 250 million tokens—six times more than GPT-5.1's 40 million—combined with a 40% price hike in OpenAI's API rates ($1.75/$14 per million input/output tokens). In comparison, Claude Opus 4.5 cost $608 to run, making GPT-5.2 the most expensive model to evaluate.
Example outputs from GDPval-AA tasks showcase GPT-5.2's enhanced creativity, producing detailed travel itineraries and music video moodboards that outshine GPT-5.1's offerings. Despite this, the high cost and resource demands may limit its practical adoption. Artificial Analysis is set to release a full update on its Intelligence Index benchmarks, which could further clarify GPT-5.2's position in the AI landscape.
This development underscores the ongoing race among AI developers to push performance boundaries, though it also raises questions about the balance between innovation and affordability.