AI Hallucinations
Are Data Problems

Home / All Blogs

AI Hallucinations Are Data Problems

View Other Blogs by Category

A light is being shone on AI hallucinations lately. Rightly so. These are the outliers in AI behaviour that have a serious negative impact on customer experience. In fact, 70% of consumers say they would leave a brand after just one bad AI experience. One. AI is a high-stakes game.

First, what is an “AI hallucination”? IBM offers a good definition:

“AI hallucination is a phenomenon wherein a large language model (LLM)—often a generative AI chatbot or computer vision tool—perceives patterns or objects that are nonexistent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate.”

So, why do they happen? There are three main reasons:

Wrong data
Poor AI model development (bias, overfitting the model to training data, calibration of the model and confidence levels in answers, etc.), and
Poor QA. Insufficient test cases to catch hallucinations, as well as insufficient test cases to induce hallucinations. Too often, QA is narrowly focused on testing application functionality. QA needs to evolve into a much broader text matrix to test “thinking systems”, because their scope is a lot broader than “functional systems”.

Now I didn’t say a lot on point # 1. You didn’t think this blog was going to ignore data, did you? Of course not! And I chose my words carefully. “Wrong” data isn’t just bad data. And I think that is the point, and also the problem. There are so many posts and articles that focus on improving data quality to improve AI and reduce hallucinations. It’s true, but it’s just part of the problem. AI needs the right data.

The right data for AI includes:

Right quality data – it needs to be high enough quality for the AI use case and it needs to be complete. Data needs to be at the right quality level to develop the AI model, and a different level of quality (aka real-world, or lower quality) to test that model and complete it’s development. So this isn’t just a case of “we need perfect data quality.”
Unbiased data – data bias needs to be detected, and removed, from training data sets.
Factually correct data – some data needs to be verified and validated using external sources, to be sure that it is factually correct.
Real-world data – data used for training needs to be real-world – so not overly cleansed, not perfect quality, and include outliers and aberrations.

In order to make your data “right” for AI, there are some important data management functions that need to improve drastically.

Data profiling – data needs to be profiled more deeply – not just reading column headers, or light sampling of large data sets. Fully profiled. It also needs to profile for new things. Bias. Factual correctness.
Data context – the context for data usage needs to be properly understood and modelled. The context of real-world data. The context of proper training data. The context of QA testing data and what you are trying to test. This needs to be documented and followed. In a lot of ways, it can be thought of as data governance policies. The same tools might be useful for modelling context, with some evolution.
Data set creation – the process to create data sets must be improved for training data. It’s based on the profiling above – to include unbiased and correct data. Different data sets for AI QA need to be created. If your QA data is derived from your training data, then the only thing you are really testing is the functionality of the AI Agent (is it functioning as intended?), not the behaviour of the AI Agent (how is it behaving in an uncontrolled environment?). QA data sets need to be ‘real world’. But they also need to induce hallucinations. Try to create them (in QA!) by including outlier and abberation data.

I think data is the most important factor in AI hallucinations. They have to be found in QA and prevented from happening in production. The negative impact on customer experience is simply too high. And I believe every single one of those hallucinations can be stamped out with a modern approach to data management.