GPT-5 achieves 94.6% accuracy on AIME 2025, suggesting near-human mathematical reasoning. Yet ask it to query your database, and success rates plummet to the teens. The Spider 2.0 benchmarks reveal a yawning gap in AI capabilities. Spider 2.0 is a comprehensive text-to-SQL benchmark that tests AI models’ ability to generate accurate SQL queries from natural language questions across real-world databases. While large language models have conquered knowledge work in mathematics, coding, and reasoning, text-to-SQL remains stubbornly difficult. The three Spider 2.0 benchmarks test real-world database querying across different environments. Spider 2.0-Snow uses Snowflake databases with 547 test examples, peaking at 59.05% accuracy. Spider 2.0-Lite spans BigQuery, Snowflake, and SQLite with another 547 examples, reaching only 37.84%. Spider 2.0-DBT tests code generation against DuckDB with 68 examples, topping out at 39.71%. This performance gap isn’t for lack of trying. Since November 2024, 56 submissions from 12 model families have competed on these benchmarks. Claude, OpenAI, DeepSeek, and others have all pushed their models against these tests. Progress has been steady, from roughly 2% to about 60%, in the last nine months. The puzzle deepens when you consider SQL’s constraints. SQL has a limited vocabulary compared to English, which has 600,000 words, or programming languages that have much broader syntaxes and libraries to know. Plus there’s plenty of SQL out there to train on. If anything, this should be easier than the open-ended reasoning tasks where models now excel. Yet even perfect SQL generation wouldn’t solve the real business challenge. Every company defines “revenue” differently. Marketing measures customer acquisition cost by campaign spend, sales calculates it using account executive costs, and finance includes fully-loaded employee expenses. These semantic differences create confusion that technical accuracy can’t resolve. The Spider 2.0 results point to a fundamental truth about data work. Technical proficiency in SQL syntax is just the entry point. The real challenge lies in business context. Understanding what the data means, how different teams define metrics, and when edge cases matter. As I wrote about in Semantic Cultivators, the bridge between raw data and business meaning requires human judgment that current AI can’t replicate.
8,84K