AI Hallucinations
Are Data Problems

AI Hallucinations Are Data Problems

David Corrigan, Data & Analytics, Master Data Management, Customer Data

by David Corrigan, Chief Strategy & Marketing Officer

A light is being shone on AI hallucinations lately. Rightly so. These are the outliers in AI behaviour that have a serious negative impact on customer experience. In fact, 70% of consumers say they would leave a brand after just one bad AI experience. One. AI is a high-stakes game. 

First, what is an “AI hallucination”? IBM offers a good definition:

“AI hallucination is a phenomenon wherein a large language model (LLM)—often a generative AI chatbot or computer vision tool—perceives patterns or objects that are nonexistent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate.”

So, why do they happen? There are three main reasons: 

  1. Wrong data 
  2. Poor AI model development (bias, overfitting the model to training data, calibration of the model and confidence levels in answers, etc.), and 
  3. Poor QA. Insufficient test cases to catch hallucinations, as well as insufficient test cases to induce hallucinations. Too often, QA is narrowly focused on testing application functionality. QA needs to evolve into a much broader text matrix to test “thinking systems”, because their scope is a lot broader than “functional systems”. 

Now I didn’t say a lot on point # 1. You didn’t think this blog was going to ignore data, did you? Of course not! And I chose my words carefully. “Wrong” data isn’t just bad data. And I think that is the point, and also the problem. There are so many posts and articles that focus on improving data quality to improve AI and reduce hallucinations. It’s true, but it’s just part of the problem. AI needs the right data. 

The right data for AI includes:

  1. Right quality data – it needs to be high enough quality for the AI use case and it needs to be complete. Data needs to be at the right quality level to develop the AI model, and a different level of quality (aka real-world, or lower quality) to test that model and complete it’s development. So this isn’t just a case of “we need perfect data quality.”
  2. Unbiased data – data bias needs to be detected, and removed, from training data sets. 
  3. Factually correct data – some data needs to be verified and validated using external sources, to be sure that it is factually correct. 
  4. Real-world data – data used for training needs to be real-world – so not overly cleansed, not perfect quality, and include outliers and aberrations. 

In order to make your data “right” for AI, there are some important data management functions that need to improve drastically. 

  1. Data profiling – data needs to be profiled more deeply – not just reading column headers, or light sampling of large data sets. Fully profiled. It also needs to profile for new things. Bias. Factual correctness. 
  2. Data context – the context for data usage needs to be properly understood and modelled. The context of real-world data. The context of proper training data. The context of QA testing data and what you are trying to test. This needs to be documented and followed. In a lot of ways, it can be thought of as data governance policies. The same tools might be useful for modelling context, with some evolution.
  3. Data set creation – the process to create data sets must be improved for training data. It’s based on the profiling above – to include unbiased and correct data. Different data sets for AI QA need to be created. If your QA data is derived from your training data, then the only thing you are really testing is the functionality of the AI Agent (is it functioning as intended?), not the behaviour of the AI Agent (how is it behaving in an uncontrolled environment?). QA data sets need to be ‘real world’. But they also need to induce hallucinations. Try to create them (in QA!) by including outlier and abberation data. 

 

I think data is the most important factor in AI hallucinations. They have to be found in QA and prevented from happening in production. The negative impact on customer experience is simply too high. And I believe every single one of those hallucinations can be stamped out with a modern approach to data management. 

On-demand Webinar - Prevent AI from Creating Customer Mishaps
Why Data Governance is the Key

Customer Data Governance and AI

Without a filter, customer-facing AI could make mistakes. Embarrassing mistakes.

Learn how to govern data & AI together to prevent it.

Related Blogs

Best-Practices-to-Migrate-to-Cloud-MDM-step-3-min
Best Practices to Migrate to a Cloud MDM Solution – Step 3 – Execution
Read More
Governing Data and AI
It's Time to Start Governing Data and AI Together
Read More
Being Together While Staying Apart
The Current State of MDM - Part 2 - The Curious Relationship Between MDM and Data Governance
Read More
MDM Data Quality
The Current State of MDM - Part 3 - MDM & Data Quality Have Had a Long Relationship. Is it Time to Renew Their Vows?
Read More
2025
Top 5 Data & AI Trends in 2025
Read More
MDM Integration
The Current State of MDM - Part 4 - How Integrated, and Integral, is MDM?
Read More
What is Data Modernization?
What is Data Modernization?
Read More
Best-Practices-for-MDM-Cloud-Migration-min
Best Practices to Migrate to a Cloud MDM Solution – Step 4 – Review
Read More
Cloud Data Warehouse Paradox FinOps
The Cloud Data Warehouse Paradox - Reducing Costs While Encouraging Data Sharing
Read More

Sign up for SparkPlug
Q Spark Group's Monthly Newsletter

QSG’s monthly newsletter is filled with insights, best practices, and success stories from our customers’ experiences in utilizing modern technology to improve their business.