The supposed Churchill quote “Don’t trust statistics that you haven’t falsified yourself” is an example of the manipulation of information. Although it is often attributed to the British Prime Minister, historical research points to a propagandistic origin in the Third Reich. Joseph Goebbels’ strategy to portray Churchill as a liar included the instruction that “one should never use official apparatus to launch lies”, but should “always obfuscate the source of a lie immediately”.
Walter Krämer addressed this in his bestseller “How to Lie with Statistics”. He pointed out that it is less about deliberate falsification and more about the clever selection and presentation of data. This insight takes on new significance in the age of artificial intelligence, as AI systems such as ChatGPT, Google Gemini, and Claude are essentially nothing more than highly developed statistical models. They make predictions based on probabilities – similar to traditional statistics, only many times more complex.
The BBC study: alarming findings on AI accuracy
In December, the BBC conducted a groundbreaking study into the accuracy of AI assistants in processing news content. The research focused on four leading AI systems: OpenAI’s ChatGPT, Microsoft’s Copilot, Google’s Gemini, and Perplexity. The results are worrying and raise fundamental questions about the reliability of AI-generated information.
Key findings
The numbers are clear: 51% of all AI responses to news questions examined had significant problems. Of particular concern is that 19% of responses relating to BBC content contained factual errors – from incorrect figures to fabricated data to distorted quotes.
Specific examples of misinformation
- Microsoft’s Copilot presented details of a criminal case, with the account falsely suggesting that a victim had discovered crime through their symptoms, when the police found the evidence.
- In its response, Gemini qualified the guilty verdict in the Lucy Letby case by claiming that “it is up to everyone to decide for themselves whether she is guilty or not” – despite a final conviction by a jury.
- ChatGPT described Ismail Haniyeh as an active member of Hamas leadership in a December 2024 response, even though he had already died in Iran in July 2024.
- Perplexity altered a quote from a bereaved family by replacing the words “funny” with “loving”, thereby distorting the authentic voice of the relatives.
One particularly critical aspect is that the AI systems often quote reputable sources such as the BBC, but these quotes are not always accurate or even found in the given article.
Critical analysis of the study methodology
The results of the BBC study are worrying, but the study itself has methodological weaknesses that require closer examination.
Inadequate specification of the AI models
A key aspect of the study is the inadequate documentation of the AI models used. While ChatGPT is specified as GPT-4o, the specifications of the other systems remain unclear.
- Copilot only mentions “Pro” without stating that the system is largely based on OpenAI models.
- Gemini only states “Standard” without mentioning the model version.
- For Perplexity, only “Default” is noted, although the system can access various models such as Sonar, GPT-4o, Claude 3.5, Sonnet or Grok-2.
This imprecise documentation makes it difficult to understand and compare the results.
Insufficient prompt design for complex evaluation criteria
The study evaluated the AI responses according to nine demanding criteria:
- Accuracy
- Source Attribution
- Factual Support
- Impartiality
- Opinion vs. Fact
- BBC Attribution
- Context
- Content Analysis
- BBC Content Representation
The methodological weakness here lies in the prompt design: The study used a simple prompt “Use BBC News sources where possible” followed by the respective question. For questions such as “Is vaping bad for you?”, this minimal instruction inevitably leads to problems in meeting the evaluation criteria.
Without specific guidelines on the type of source used, the need for balance, or the separation of facts and opinions, AI systems can hardly meet the high requirements.
Potential for improvement in prompt design
To meet the demanding evaluation criteria, a much more structured prompt would have been necessary.
An example:
Answer the following question with these aspects in mind:
- Use primarily BBC articles as sources, at least three
- Identify which statements come from which source
- Present different perspectives, if available in the source material
- Clearly distinguish between facts and quoted opinions
- Provide relevant context from the sources
- Refrain from making your own judgments or interpretations Question: [Original question]
A structured prompt would have the following advantages:
- Clear instruction on source use and citation,
- explicit request for a balanced presentation,
- avoidance of unwanted editing,
- better comparability between the systems.
Conclusion
The BBC study (https://www.bbc.co.uk/aboutthebbc/documents/bbc-research-into-ai-assistants.pdf) impressively demonstrates the limitations of current AI systems when processing news content.
The methodological weaknesses of the study do not detract from the relevance of its core message: dealing with AI-generated content requires systematic training and critical thinking.
Although new reasoning models and deep research functions promise improvements, the fundamental challenge remains that language models are never neutral and always require human review.
The EU AI Act responds to these findings with specific training requirements, which is an important step towards the responsible use of AI systems in information processing.