Benchmark

API

Webinar

Evaluating LLM Hallucination Detectors

Evaluating LLM Hallucination Detectors

Evaluating LLM Hallucination Detectors

Haziqa Sajid

Aug 30, 2024

Discover the limitations of traditional LLM evaluation methods and how Pythia's granular approach provides a more accurate assessment of LLM hallucinations.

Did you know that 67% of organizations use generative AI in their workflows to optimize operations? Yet, leading LLMs hallucinate up to 27% of the time in real scenarios.

LLM hallucinations can lead to serious consequences such as spreading false information, impaired judgment, loss of trust, and perpetuation of bias. These consequences are especially problematic in critical domains like news, healthcare, and legal advice. 

Let’s explore the traditional evaluation methods for measuring hallucination detectors and how Pythia outperforms them.

What Are LLM Hallucinations?

LLM hallucinations refer to instances where LLMs generate undesired outputs. These outputs can be in the form of inconsistent results, nonsensical outputs, or fabricated information. 

In healthcare, LLM hallucinations can lead to deaths and serious disabilities due to misdiagnosis. Legal errors arising from LLM-generated content also have severe consequences, with settlements reaching $250,000. For example, AirCanada’s chatbot invented a discount policy, which caused customer dissatisfaction and loss of trust.

Common LLM Application Tasks

LLM configuration depends on the task the LLM application handles. Below are a few of the most common tasks LLM applications perform:

Zero Context Q&A

Zero context represents a basic chatbot where the user provides minimal input, often a question, and receives a small output. The output size can vary depending on the input, but no external information is involved. Therefore, the LLM may generate inaccurate or irrelevant responses due to the limited information. The response can also be biased due to improper question framing. 

For example, a question like "Why is South Carolina BBQ worse than Texas BBQ?" might lead to an LLM-generated response that assumes the premise is true.

Retrieval-Augmented Generation (RAG) Q&A

RAG improves the accuracy of large language models by using user input to retrieve relevant documents or passages from a pre-existing knowledge base.  The retrieved documents are then passed to the LLM to generate an answer. This method aims to provide more grounded and accurate responses by reducing hallucinations. 

However, if the retrieved documents are incorrect or irrelevant, the generated answer may also be factually incorrect. 

Text Summarization

Text summarizers take a large input, such as a document, and summarize it into a smaller output. Compared to Q&A tasks, text summarizers have a more complex concept of accuracy, especially when summarizing content with known inaccuracies. The summary should accurately reflect the content and intent of the original input. 

For example, when summarizing patient documents, the summary should correctly represent critical details like drug prescriptions.

Pythia Strategies for Measuring and Benchmarking LLMs

Pythia's methodology involves claim extraction and categorization for granular evaluation. Here's how it works:

Claim Extraction

Pythia extracts claims as triplets (<subject, predicate, object>) from text and then evaluates their accuracy. This approach provides more actionable insights than simply evaluating text inputs and outputs with an LLM. You can identify specific hallucinations or errors by labeling these triplets as correct or incorrect. 

For example, the following table represents claims in triplets form:

Claim Categorization

The next step is to classify and categorize the extracted triplets. This involves comparing the response claims with reference claims and categorizing them based on their deviation.

The claims are divided into the following categories:

Entailment

Claims that exist in both the response and the reference are categorized as Entailment. 

Contradiction

Claims that exist in LLM responses but are absent in references are categorized as Contradiction. These claims indicate clear errors that need correction.

Missing Claims

Claims present in the reference but missing from the generated output are categorized as missing claims. Missing information may reduce the completeness or accuracy of the generated response. 

Neutral

Claims present in the generated response but absent in the reference are categorized as Neutral. While these claims are technically hallucinations, they aren't necessarily harmful.

Accuracy Metrics

Pythia's accuracy metrics focus on entailment and contradiction to measure hallucinations in LLM outputs. The metric calculates the harmonic mean of the entailment and non-contradiction rates, balancing the evaluation's precision and recall aspects.

Reliability is an optional component of the Pythia accuracy metric. It verifies the neutrals in the evaluation, enhancing the accuracy of the assessment.

Grading Strategy

Pythia uses a letter-grade system (A through F) to evaluate the generated response. The letter grades are then mapped to a common numerical scale to produce a score. This grading strategy's simplicity allows for easy comparison to more complex evaluation methods.

Lynx Evaluation Strategy

Lynx is an open-source LLM hallucination evaluation model that uses a simpler Pass/Fail system for evaluating responses. This binary approach minimizes complexity while maintaining a clear standard for assessing response quality. However, it also limits the ability to capture more nuanced differences in performance. 

Pythia Vs. Lynx Evaluation

The inability to capture nuanced details allows Lynx to detect hallucinations only in RAG-based Q&As. However, it cannot detect hallucinations in complex LLM tasks like text summarization. 

Therefore, Pythia’s grading system has a more granular grading strategy that can detect hallucinations in all LLM contexts. This includes zero-context chatbots, RAG Q&As, and text summarizers.

The following table highlights the difference between Pythia and Lynx Strategies:

Measuring the Effectiveness of Hallucination Detectors

Measuring how effectively the detection system identifies hallucinations involves evaluating the measurement system itself. It includes the following components:

NLP Benchmarking

Traditional NLP benchmarks often use specialized data sets for various tasks such as named entity recognition, coreference resolution, etc. These data sets can also be specialized by domain, such as medical-related data for hallucination detection in medical LLMs or by task type. 

However, if a dataset does not match the specific domain or task, its relevance and effectiveness for evaluation are limited. Benchmark datasets are usually pre-processed, which can differ significantly from real-world, natural data. Some datasets are outdated and may not reflect current language use or challenges. 

Newer language models require the measurement of more abstract concepts such as factuality, consistency, and faithfulness. Unlike traditional NLP tasks, these metrics are less standardized and more complex to define. 

Dataset Requirements

You must have access to all input and output data when measuring your hallucination detection systems. 

The essential requirements for measuring a system include:

Raw Datasets

The datasets must be raw and include both good and bad examples. This ensures that the measurement system identifies accurate responses and is adept at flagging errors. For example, BioASQ focuses on providing ideal answers but lacks examples of incorrect answers, making evaluation ineffective. 

Human Generated Labels

Human-generated labels are crucial for measuring the accuracy of the hallucination measurement system. This is because relying on automated labels requires evaluating the accuracy of that labeling system. This leads to an endless loop of measurement without a reliable foundation. 

Access to Suitable Datasets

Only a few datasets currently contain raw data, balanced examples, and human-generated labels. This shortage presents a significant challenge in accurately measuring the performance of language models. 

The scarcity of suitable datasets makes this a challenging task, but ongoing efforts to build better datasets address this need, especially in anticipation of new regulations.

Summarization Datasets

Common NLP benchmarking datasets used in text summarization tasks include:

QAGS-CNNDM

  • Source: This dataset is generated from CNN and Daily Mail news articles.

  • Labeling: Mechanical Turkers label the data.

  • Summary Generation: The summaries are generated through specific methods tailored for CNN and Daily Mail articles.

  • Evaluation Metrics: Each sentence in the summary is labeled as "yes" or "no" by the labelers, determining if it is a good sentence for summarization.

  • Examples: The dataset contains both good and bad examples.

QAGS-XSUM

  • Source: This dataset is generated from BBC news articles.

  • Labeling: Mechanical Turkers label the data.

  • Summary Generation: Summaries in this dataset are generated through specific methods tailored for BBC articles.

  • Evaluation Metrics: Each sentence in the summary is labeled as "yes" or "no" by the labelers, determining if it is a good sentence for summarization.

  • Examples: The dataset contains both good and bad examples.

SummEval

  • Source: This dataset is derived from CNN and Daily Mail articles.

  • Labeling: Labeling is done by both experts and Mechanical Turkers. The dataset records these labels separately, allowing for differentiation between expert and Mechanical Turker evaluations.

  • Summary Generation: Summaries are generated through various methods, similar to QAGS-CNNDM.

  • Evaluation Metrics: Consistency, coherence, fluency, and relevance are measured, primarily focusing on consistency in some analyses. The study found little correlation between expert labels and those from Mechanical Turkers, leading to a preference for using expert labels for consistency.

  • Examples: The dataset contains both good and bad examples.

Evaluation Metrics for Summarization

Spearman correlation is a benchmark summarization metric used to evaluate the performance of various models and generate a leaderboard. However, the Spearman correlation ranks each item in the dataset based on its label. For example, if items 10, 11, and 12 have the same label, they are all given the same rank (e.g., rank 11). The Spearman correlation doesn't accurately reflect the differences between items when the dataset is less granular. 

Therefore, a more precise or granular system might perform worse in Spearman correlation-based rankings. As a result, models like Pythia that perform well in more detailed or granular evaluations might be unfairly penalized in Spearman correlation-based rankings due to the lower granularity of the dataset labels.

MAE as an Alternate Metric

Mean Absolute Error (MAE) addressed the issues with Spearman correlation. Unlike the Spearman correlation, MAE measures the average error between predicted values and human labels without being affected by the ties. This approach avoids penalizing models that provide more granular outputs, making it a more balanced evaluation metric for systems like Pythia.

The Pythia system shows much better performance when evaluated with MAE. This confirms that MAE is a better alternative to the Spearman correlation. It also evaluates Pythia's strategy, which is indeed valuable and leads to more accurate results.

RAG QnA Datasets

Common datasets used to evaluate the accuracy and reliability of RAGs include:

  • DROP: A reading comprehension dataset created with Wikipedia pages.

  • FinanceBench: A dataset designed for benchmarking financial NLP tasks, focusing on financial texts.

  • RAGTruth: A dataset for evaluating Retrieval-Augmented Generation (RAG) models using data from MS MARCO.

  • CovidQA: A question-answering dataset related to COVID-19 derived from scientific articles.

  • Halueval: A dataset designed to evaluate the hallucination phenomenon in the generated text.

  • Pubmed: A biomedical literature dataset often used for tasks like summarization or information extraction.

RAG QnA Metrics

Typical evaluation metrics used for RAG QnA include:

Mean Absolute Error (MAE)

In binary settings, MAE measures the absolute difference between predictions and actual values. However, it may not effectively capture performance, especially with highly granular data where values are concentrated in extremes. 

Receiver Operating Characteristic (ROC) and Precision-Recall Curve

ROC and precision-recall curves provide insights into classification performance. These curves help evaluate how well a model differentiates between classes. However, due to the detailed nature of the metrics involved, they may not be suitable for granular grading or link strategies.

Accuracy

Accuracy is a more suitable metric for binary classification and datasets with varied granularity. It measures the proportion of correct predictions out of all predictions. However, accurate thresholding is essential for Pythia's grading strategy.

Last Words

Running a dataset through a benchmarking system is no longer sufficient for evaluating NLP models. It is crucial to understand how the benchmarking data aligns with your system's inputs and outputs. For example, if the dataset measures summaries differently from your system, the results might not be informative. Metrics should also be selected based on the dataset and task.

Learn more about the limitations of traditional LLM benchmarking and Pythia's effective solution in the webinar here.

Future Work

We are continuously using new methods for evaluation. A few of them include:

  1. Testing Strategies

  • Pythia V2

  • RAGAS

  1. Testing Models

  • Larger vs. smaller models

  • General vs fine-tuned models

  1. Calibration

  2. Ensemble methods

FAQs

How to compare different systems on your own data?

Evaluate systems on your own data after comparing them to open datasets. Examine the distribution of scores and outputs for your data and compare them to the open dataset results. If your dataset shows models predicting significantly lower scores compared to the open dataset, this might indicate a problem with the model’s performance on your data.

How to figure out what metric to use?

It's crucial to comprehend how your measurement system works and how to apply it effectively. This includes understanding the data and metrics you use and ensuring they align with your expectations for both input and output. Metrics should be adapted based on the specific characteristics of your dataset. For example, if using the Pythia system, the accuracy metric based on entailment and contradiction is useful.

How to measure an LLM without much data?

You can use open datasets to evaluate your system. However, instead of relying solely on these datasets to measure the system, you can run your own outputs through your Q&A system. Compare the measurement distributions of your outputs to those from the dataset. For example, if your system yields significantly higher scores than the balanced scores from the dataset, it suggests your system may perform better. Conversely, if your scores are much lower, it could indicate potential issues with your system.

Is Pythia limited to a specific industry?

Pythia has been used in various fields, including clinical, generic, and financial domains. 

Do you offer a free trial for Pythia?

Yes, we offer a free trial for Pythia. Sign up for Pythia to start your free trial.

Haziqa Sajid

Share this post