Investigation
Same Questions. New Models. Mixed Results.
Large language models are often judged on complex benchmarks, but some of their most interesting failures show up on questions that seem trivial at first glance. In this article, we test a range of OpenAI models using a small set of deliberately easy questions, the kind that have a track