How Pythia Improved LLM Reliability for a Pharma Company

Haziqa Sajid

Jul 23, 2024

Explore how Pythia helped a client achieve an accuracy of 98.8% using a billion-scale knowledge graph, knowledge triplets, and its distinct methodology.

Natural Language Processing (NLP) in healthcare has multiple useful applications. These include clinical trial optimization, information extraction and knowledge discovery, drug development, and data analysis, including electronic health records (EHRs) and clinical notes.

A leading pharmaceutical company reached out to Wisecube to detect high-risk hallucinations in their LLM and ensure factual accuracy. Wisecube addressed their challenges with Pythia’s distinct approach to detecting real-time hallucinations in LLMs.

Here’s how we helped our healthcare client achieve an accuracy of 98.8% in their LLM.

The Challenge: Unreliable Patient Summaries

The pharma client struggled with multiple challenges, from an inability to handle rapidly growing biomedical data to the misinterpretation of references by LLM. Here’s a list of all the challenges they faced:

Manual Reliability Check

The pharma company had to assess LLM reliability manually. They used multiple techniques such as fact checking, consistency checks, and bias detection, but manual assessment wasn't scalable. The company needed a solution that monitors LLM hallucinations in real-time without human involvement.

Data Explosion

The rapid and exponential growth of biomedical literature made it difficult to manage and extract meaningful insights from the vast volume of information. The client wanted us to extract facts from a large corpus of data that they can use to assess LLM accuracy.

Data Fragmentation

Biomedical data was scattered across fragmented, small-scale datasets, hindering the synthesis of comprehensive insights and the identification of groundbreaking discoveries. They sought a solution that could combine all datasets for thorough analysis and interconnect biomedical facts within a cohesive network.

Data Access Barrier

A significant portion of biomedical research findings was inaccessible for replication due to limited data availability, hindering the validation and advancement of discoveries. The client wanted a comprehensive data integration from multiple sources, overcoming silos and providing a more complete and accessible dataset for research and development purposes.

Neutral Claims

The LLM generated neutral claims and made assumptions based on incomplete or missing data, potentially introducing biases or inaccuracies into the insights. Neutral claims can mislead healthcare professionals if not carefully reviewed and contextualized. Therefore, they wanted a solution capable of minimizing the biases and assumptions.

Misinterpretation of Missing Data

The LLM occasionally misinterpreted missing history sections as a lack of relevant information, leading to inaccuracies in data interpretation. They needed a solution recognizing that the absence of certain data in patient records does not necessarily indicate the absence of a condition or history. The solution should also indicate the lack of information as a gap rather than an inferred absence.

Additionally, a continuous monitoring system that flags instances where the AI might misinterpret missing data could improve accuracy and reliability in AI-generated insights.

How Pythia Addressed the Challenges

The healthcare client provided 20 docx files containing reference texts and 20 JSON files with summary and meta information. Pythia used this data and Wisecube’s billion-scale knowledge graph to enhance the accuracy and reliability of AI-generated insights.

Billion-scale Knowledge Graph

Wisecube constructed a robust knowledge graph to unify and interlink disparate biomedical facts within a comprehensive graph, using existing ontologies to enrich its context and relevance. The resulting graph framework contained over 5 billion semantic facts with 240 million citations and 100 million ontological relationships.

Semantic Discovery Platform

Pythia tackled the issue of data explosion by using its Semantic Discovery Platform to manage and synthesize large volumes of biomedical literature. By integrating different data sources into a clear, structured format, Pythia simplified navigation and made it easier to extract valuable insights from the expanding pool of information.

Knowledge Triplets

Extracting claims in the form of knowledge triplets allowed Pythia to represent complex biomedical information in a clear and organized manner. The triplets also allowed Pythia to contextualize information by clearly defining relationships between entities. This helped generate accurate and relevant insights based on the interconnected data.

Data Accessibility

Pythia improved data accessibility by organizing pertinent information from the client's reference documents. This made it readily accessible for researchers and healthcare professionals. Pythia also provided a user-friendly interface for accessing and exploring LLM performance, making it easier for stakeholders to track Pythia’s outcomes.

Continuous Monitoring

Continuous monitoring of LLMs eliminated the need for manual reliability assessment. It allowed the technical teams to focus on LLM improvements rather than repetitive fact-checking. Pythia’s continuous monitoring system improved the interpretation of medical histories. Pythia helped prevent misinterpretation by continuously monitoring LLM responses and accurately reflecting their reliability. Avoiding misinterpretation of missing histories ensured that the data presented was accurate and meaningful.

General Medical Knowledge Vs. Specific Patient Data

Pythia improved the relevance of insights by focusing on patient-specific data rather than general medical knowledge that might not be directly applicable. With this approach, Pythia ensured that the insights and recommendations were based on actual patient records instead of generalized information.

Measuring Systematic Checker Error (SCErr)

Pythia uses Systematic Checker Error (SCErr) to calculate the reliability of LLM responses. SCErr is computed as the harmonic mean of Entailment and Contradiction, providing a measure of reliability for LLM responses. The ideal results should be 100% Entailment and 0% Contradiction.

Results: 98.8% LLM Accuracy

With an overall accuracy of 98.8%, Pythia effectively integrated the client’s reference documents into a cohesive Knowledge Graph, providing contextually relevant and accurate information.

The use of systematic checker error metrics highlighted Pythia’s ability to align closely with the provided data while addressing inconsistencies and contradictions. Despite occasional neutral claims and some misinterpretations, Pythia’s approach notably reduced errors and biases. It ensured that the insights derived were both precise and actionable.

Pythia significantly reduced neutral claims, reflecting continuous improvement of LLM through real-time identification of hallucinations. Here’s the metric breakdown per patient:

These results represent Pythia's effectiveness in improving the reliability of LLMs in critical domains like healthcare.

The existing neutral claims highlight the need for careful review and contextual consideration. Another recommendation is not to output the empty medical history sections. Real-time monitoring of LLM responses with Pythia highlights real-time hallucinations, allowing the LLM developers to take corrective actions on time. This results in continuous improvement of AI and a further reduction in neutral claims.

Haziqa Sajid

Share this post