# Study Alleges LM Arena Gave Top AI Labs an Unfair Advantage in Benchmarking

## Study Alleges LM Arena Gave Top AI Labs an Unfair Advantage in Benchmarking

A new study is raising serious questions about the impartiality of LM Arena, a popular platform used to benchmark large language models (LLMs). The paper, authored by researchers from Cohere, Stanford, MIT, and AI2, accuses LM Arena of providing preferential treatment to a select group of AI companies, including industry giants like Meta, OpenAI, Google, and Amazon.

According to the study, LM Arena allowed these favored companies to privately test multiple iterations of their AI models and selectively publish only the highest-performing scores. This practice, the authors argue, made it easier for these companies to dominate the Chatbot Arena leaderboard, a key resource for assessing AI model performance.

“Only a handful of [companies] were told that this private testing was available, and the amount of private testing that some [companies] received is just so much more than others,” stated Sara Hooker, VP of AI research at Cohere and a co-author of the study, in an interview with TechCrunch. “This is gamification.”

Chatbot Arena, established in 2023 as a UC Berkeley research project, quickly became a widely adopted benchmark for AI companies. The platform pits two AI models against each other, presenting users with side-by-side responses and asking them to choose the better one. The aggregate votes determine a model’s score and its position on the leaderboard. While LM Arena has consistently maintained its impartiality, the new study challenges this claim.

The study alleges that Meta, for example, privately tested 27 model variants on Chatbot Arena leading up to the release of Llama 4. However, only the score of the top-ranking model was publicly disclosed.

In response to the accusations, LM Arena Co-Founder and UC Berkeley Professor Ion Stoica told TechCrunch that the study contained “inaccuracies” and “questionable analysis.”

“We are committed to fair, community-driven evaluations, and invite all model providers to submit more models for testing and to improve their performance on human preference,” LM Arena stated. “If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly.”

Armand Joulin, a principal researcher at Google DeepMind, also weighed in on X, disputing some of the study’s figures. He claimed that Google only submitted one Gemma 3 AI model for pre-release testing. Hooker responded on X, stating that the authors would issue a correction.

The researchers began their investigation in November 2024, driven by suspicions of preferential access to Chatbot Arena. They analyzed over 2.8 million Chatbot Arena battles over a five-month period. Their findings suggest that certain companies were allowed to collect more data from Chatbot Arena by having their models participate in a higher number of battles, giving them an unfair advantage. The study claims this increased sampling rate could improve a model’s performance on Arena Hard, another LM Arena benchmark, by as much as 112%. LM Arena disputes that Arena Hard performance directly translates to Chatbot Arena performance.

Hooker acknowledges that the method used to identify models undergoing private testing, relying on the AI models themselves to reveal their company of origin, isn’t foolproof. However, she notes that LM Arena did not dispute the study’s preliminary findings when contacted.

TechCrunch has reached out to Meta, Google, OpenAI, and Amazon for comment, but has not yet received a response.

The paper concludes with a call for greater transparency from LM Arena. The authors recommend implementing a clear limit on the number of private tests AI labs can conduct and publicly disclosing scores from these tests.

LM Arena has pushed back against these recommendations, stating that it already publishes information on pre-release testing and that it “makes no sense to show scores for pre-release models which are not publicly available.” However, the organization has expressed openness to adjusting Chatbot Arena’s sampling rate to ensure equal participation across all models.

The controversy follows criticism leveled at Meta earlier this month for optimizing a Llama 4 model specifically for “conversationality” to achieve a high score on Chatbot Arena, without releasing the optimized model to the public.

This new study intensifies scrutiny of privately operated benchmarking organizations, especially as LM Arena recently announced plans to launch as a company and raise capital. The question now is whether these organizations can truly be trusted to objectively assess AI models without being influenced by corporate interests.

Yorumlar

Bir yanıt yazın

E-posta adresiniz yayınlanmayacak. Gerekli alanlar * ile işaretlenmişlerdir