When an organization releases a brand new AI video generator, it’s not lengthy earlier than somebody makes use of it to make a video of actor Will Smith consuming spaghetti.
It’s change into one thing of a meme in addition to a benchmark: Seeing whether or not a brand new video generator can realistically render Smith slurping down a bowl of noodles. Smith himself parodied the pattern in an Instagram publish in February.
Google Veo 2 has carried out it.
We are actually consuming spaghett eventually. pic.twitter.com/AZO81w8JC0
— Jerrod Lew (@jerrod_lew) December 17, 2024
Will Smith and pasta is however one in all a number of weird “unofficial” benchmarks to take the AI group by storm in 2024. A 16-year-old developer constructed an app that offers AI management over Minecraft and exams its means to design buildings. Elsewhere, a British programmer created a platform the place AI performs video games like Pictionary and Join 4 in opposition to one another.
It’s not like there aren’t extra educational exams of an AI’s efficiency. So why did the weirder ones blow up?
For one, most of the industry-standard AI benchmarks don’t inform the typical particular person very a lot. Firms usually cite their AI’s means to reply questions on Math Olympiad exams, or determine believable options to PhD-level issues. But most individuals — yours actually included — use chatbots for issues like responding to emails and fundamental analysis.
Crowdsourced {industry} measures aren’t essentially higher or extra informative.
Take, for instance, Chatbot Area, a public benchmark many AI fanatics and builders comply with obsessively. Chatbot Area lets anybody on the internet fee how nicely AI performs on specific duties, like creating an online app or producing a picture. However raters have a tendency to not be consultant — most come from AI and tech {industry} circles — and solid their votes based mostly on private, hard-to-pin-down preferences.
Ethan Mollick, a professor of administration at Wharton, just lately identified in a publish on X one other downside with many AI {industry} benchmarks: they don’t evaluate a system’s efficiency to that of the typical particular person.
“The truth that there are usually not 30 totally different benchmarks from totally different organizations in drugs, in regulation, in recommendation high quality, and so forth is an actual disgrace, as persons are utilizing programs for these items, regardless,” Mollick wrote.
Bizarre AI benchmarks like Join 4, Minecraft, and Will Smith consuming spaghetti are most actually not empirical — and even all that generalizable. Simply because an AI nails the Will Smith check doesn’t imply it’ll generate, say, a burger nicely.
One skilled I spoke to about AI benchmarks instructed that the AI group concentrate on the downstream impacts of AI as an alternative of its means in slender domains. That’s wise. However I’ve a sense that bizarre benchmarks aren’t going away anytime quickly. Not solely are they entertaining — who doesn’t like watching AI construct Minecraft castles? — however they’re simple to grasp. And as my colleague Max Zeff wrote about just lately, the {industry} continues to grapple with distilling a know-how as advanced as AI into digestible advertising and marketing.
The one query in my thoughts is, which odd new benchmarks will go viral in 2025?
TechCrunch has an AI-focused e-newsletter! Join right here to get it in your inbox each Wednesday.