New study exposes chatbots’ dangerous drift from facts.
A bombshell study in the Royal Society reveals that 73% of AI-generated summaries of scientific research contain inaccuracies or critical omissions.
Researchers analysed nearly 5,000 summaries from top chatbots—including ChatGPT-4o, DeepSeek, and LLaMA 3.3 70B—and found that newer models were more error-prone than older ones, contradicting industry promises of steady improvement.
Shockingly, ChatGPT-4o was nine times likelier to skip key details than its predecessor, while Meta’s LLaMA 3.3 70B overgeneralised 36 times more often than earlier versions.
The implications are dire. As AI infiltrates medicine, engineering, and education, flawed summaries could distort research, misguide professionals, and even endanger lives.
Humans instinctively filter nuance—like knowing stoves burn but fridges don’t—while AI struggles with context. Yet, despite mounting evidence of unreliability, companies keep pushing chatbots into workplaces.
The study suggests prompt engineering may help, but for now, one truth remains: when accuracy matters, humans still reign supreme.