Technology

Researchers tested 8 AI models against Premier League betting markets – All lost money, and 6 went bust

Ryan Brothwell 11 April 2026 4 min read

Every single one of the eight frontier AI models tested lost money betting on the English Premier League, and six of them went completely bust, wiping out their simulated £100,000 bankrolls.

The study, conducted by General Reasoning, Inc., was called KellyBench, a rigorous long-horizon evaluation environment designed to test AI agents’ ability to make sequential decisions under uncertainty.

Unlike typical AI benchmarks focused on short, isolated tasks, KellyBench simulates an entire 2023-24 Premier League season, forcing models to build predictive models, spot value in betting odds, size bets appropriately, manage risk, and adapt as new data arrives week after week.

Each AI agent started with a normalised £100,000 bankroll and had access to detailed historical data, advanced statistics, lineups, past results, and public betting odds.

Over the course of 100–150 matchdays, the models were required to alternate between developing machine learning models, placing bets, and analysing outcomes, often consuming hundreds of tool calls and tens to hundreds of millions of tokens per full-season simulation.

The results were unambiguous. All eight models lost money on average across three independent runs. Six of the eight models went bust (reached £0) in at least one run, and several did so repeatedly.

Even the best models struggled

Top performer Claude Opus 4.6 posted the least painful average return of -11.0%, ending the season with a mean bankroll of £89,035. It was the only model to show a positive return in its best individual run (+21.5%), though it still lost money overall.

GPT-5.4 came in a close second with a -13.6% average ROI and a final mean bankroll of £86,365. These two were the only models that avoided complete ruin across all three seeds.

The rest of the field fared far worse:

Gemini 3.1 Pro: -43.3% ROI
Gemini 3.1 Flash Lite PREVIEW: -58.4% ROI
GLM-5: -58.8% ROI
Kimi K2.5: -68.3% ROI
Arcee Trinity (Trinity-Large): -84.2% ROI
Grok 4.20: -88.2% ROI

Several models, including Grok 4.20, Arcee Trinity, and Kimi K2.5, frequently went to zero, highlighting a critical failure in long-term capital preservation.

Why did the AIs fail

According to the researchers, current frontier models struggle with coherent behaviour over long time horizons.

Many failed to consistently act on their own analyses, adapt strategies as the season progressed and team forms shifted, or implement disciplined risk management.

A custom “sophistication” metric, based on a 44-point rubric developed with quantitative betting experts, revealed that even the strongest models scored below one-third of the maximum points.

Criteria included proper use of Kelly Criterion-style staking (fractional bankroll sizing based on edge), dynamic team strength modelling, and handling of non-stationary factors like promoted teams.

Higher sophistication scores correlated with better returns and lower bankruptcy rates, but no model came close to human-level strategic depth in this domain.

The two strongest models, Claude Opus 4.6 and GPT-5.4, stood out for three key behaviors:

Retraining or adjusting strategies in response to new match data
Using systematic staking rules rather than arbitrary bet sizes
Preserving capital during periods when they identified no clear edge

KellyBench was built on the Open Reward Standard (ORS) and is now available as an open-access API endpoint.

The researchers argue that existing AI evaluations fall short because they don’t capture the challenges of real-world, open-ended decision-making in non-stationary environments.

“Existing evaluations do not measure capabilities in long-horizon, non-stationary environments with open-ended goals,” the team noted. “KellyBench is an early example of the type of environment we need as AI systems become more agentic.”

Running the benchmark is computationally expensive: GPT-5.4 cost roughly $1,571 per full episode on average, while Claude Opus 4.6 ran at about $969. Cheaper open models completed episodes for $30–$40.

A full research paper detailing the environment, methodology, and analysis is available on the General Reasoning website.