The Evel Knievel Problem

← Back to Articles Evel Knievel jumping over 14 Greyhound buses at Kings Island, 1975

Evel Knievel clears 14 Greyhound buses at Kings Island, October 25, 1975. Photo: Cincinnati Enquirer

Fourteen Buses and Fourteen Models

On October 25, 1975, Evel Knievel pointed his Harley-Davidson XR-750 at a row of fourteen Greyhound buses in an Ohio parking lot, hit 95 miles per hour, and sailed 133 feet through cold, drizzly air. Half the nation watched on ABC. It was his longest successful jump — a record that stood for 24 years — and he barely made it, his rear wheel clipping the roof of the final bus.

I think about that jump a lot these days. Not because I'm nostalgic for the 1970s, though at 83 I've earned the right to be. I think about it because the artificial intelligence industry in 2025 looked a lot like Knievel lining up for that fourteenth bus: moving at tremendous speed, adding obstacles faster than anyone can safely clear them, and hoping the landing gear holds.

Consider the numbers. Between January 2025 and February 2026, OpenAI, Anthropic, and Google collectively released at least twenty major model versions. GPT-4.5 in February. Gemini 2.5 Pro in March. Claude Sonnet 4 and Opus 4 in May. GPT-5 in August. Claude Sonnet 4.5 in September. Gemini 3 Pro in November. Claude Opus 4.5 a week later. GPT-5.2 two weeks after that. Gemini 3 Flash five days later.

The pace is exhilarating. It is also, I believe, the single greatest argument for why no enterprise, no government agency, and no individual should trust any single AI model to deliver consistently reliable information.

The 88% Problem

According to Menlo Ventures' 2025 State of Generative AI report, OpenAI, Anthropic, and Google together command 88% of enterprise LLM API usage. The remaining 12% is scattered among Meta's Llama, Cohere, Mistral, and a long tail of smaller providers. Enterprise spending on generative AI hit $37 billion in 2025, up 3.2 times from the prior year.

But here's what most people miss: the distribution of that 88% shifted radically. In 2023, OpenAI held 50% of enterprise LLM spend. By 2025, it had fallen to 27%. Anthropic surged from 12% to 40%. Google tripled from 7% to 21%. In the coding market specifically, Anthropic now commands 54% share.

These aren't gentle market rotations. These are tectonic shifts happening in months, not years. And they reflect something important: each new model release doesn't just add capability — it reshuffles which model is best at what.

What Fifty Years of Risk Assessment Taught Me

I came to this problem not from Silicon Valley venture circles but from an actuarial background. I spent decades quantifying risk for insurance companies and benefits platforms. Actuaries are professional skeptics. We don't trust single data points. We triangulate. We cross-reference. We build redundancy into every calculation because we know that any individual estimate, no matter how sophisticated, carries embedded error.

When I first encountered ChatGPT in late 2023, my actuarial instincts fired immediately. Here was a system that delivered answers with extraordinary confidence and zero margin of error disclosure. It didn't say "I'm 73% sure about this." It said "Here's the answer," and sometimes the answer was completely fabricated.

That experience led me to build Hallucinations.cloud and the H-LLM Multi-Model platform. The core idea is simple, borrowed directly from actuarial science: if you want to know whether an answer is reliable, don't ask one oracle. Ask eight.

The Multi-Model Thesis

The H-LLM platform simultaneously queries eight AI models with the same prompt and compares their responses. When models agree, confidence is high. When they diverge, the system flags the discrepancy and generates an H-Score — a reliability rating that tells users how much trust to place in any given answer.

This approach works precisely because of the competitive dynamics described above. Because OpenAI, Anthropic, and Google train their models on different data, with different architectures, different safety frameworks, and different optimization targets, their failure modes are largely independent.

The Knievel Parallel

Evel Knievel's career offers a surprisingly apt metaphor for where AI stands today. His jumps got longer and more spectacular over time, but his safety engineering didn't keep pace. He relied on speed, courage, and a prayer that the physics would work out.

The AI industry is adding buses at an astonishing rate. Fourteen major model releases in twelve months. Each one faster, more capable, more impressive than the last. But the safety infrastructure — the mechanisms for verifying whether these models are actually telling the truth — hasn't scaled at the same pace.

AI needs its Maddison moment. The models themselves are improving at a breathtaking pace. What's missing is the verification infrastructure — the equivalent of a properly engineered landing ramp — that lets organizations deploy these models with confidence that the information they produce is actually reliable.

What Comes Next

The AI arms race isn't slowing down. If anything, the pace is accelerating. The question isn't whether these models will get more powerful. They will. The question is whether we'll build the verification systems that let society use that power safely.

Evel Knievel's back wheel clipped that fourteenth bus. He held on, kept the bike upright, and rolled to a stop. The crowd went wild. But watching the replay, you can see how close it was to disaster.

The AI industry is clipping the fourteenth bus right now. The question is whether we land it.

Brian R. Demsey is the founder and CEO of Hallucinations.cloud LLC, an AI safety company focused on detecting misinformation through multi-model verification. He previously founded RemoteNet Corporation and has over 50 years of experience in enterprise technology.