Benchmark

API

Learning Resource

Navigating Risks Associated with Unreliable AI & Trustworthiness in LLMs

Navigating Risks Associated with Unreliable AI & Trustworthiness in LLMs

Navigating Risks Associated with Unreliable AI & Trustworthiness in LLMs

Haziqa Sajid

May 15, 2024

Large language models (LLMs) have revolutionized healthcare. From speeding up drug discovery to pinpointing accurate diagnosis and personalized treatments, LLMs promise a breakthrough in biomedicine. However, Large language models are prone to errors. With a 3% hallucination rate for GPT 4 and a 15.8 % for Google AI models, blind reliance on them poses a threat to healthcare research and human life. 

Flaws in LLMs lead to unreliable decisions and misleading suggestions, ultimately affecting human life. Unwanted decisions cost researchers and companies time, money, and reputation, delaying the research process. 

LLMs reduce researchers’ burden by analyzing insights and speeding up the research process. However,  LLMs also raise ethical concerns, including data privacy and security. Overlooking ethical considerations and spreading biased AI decisions may lead to legal consequences and reputational damage.

Developing reliable LLMs is crucial for safer healthcare research. WHO emphasized this urgency, issuing a warning for the responsible use of LLMs to protect people’s health and reduce inequity. 

In this article, we’ll discuss risks associated with unreliable LLMs, principles for developing trustworthy AI, and the role of Wisecube’s Pythia in facilitating reliable AI for healthcare researchers.

Risks Associated with Unreliable AI

If overlooked, the negative effects of AI significantly impact biomedical research and patient health. Some risks linked to unreliable AI include:

Sustaining bias

AI trained on biased data perpetuates existing biases by generating biased outputs. Sustaining AI bias slows down the research process and even compromises the company's reputation. For example, AI makes racist patient care decisions by generating fabricated differences between black patients and white patients. The discrimination in another model’s training data led AI models to underdiagnose black men with lung diseases, resulting in fewer black patients getting medical care.

Spreading misinformation

LLM hallucinations lead to inaccurate and fabricated AI content. In biomedicine, inaccurate outputs result in misdiagnosis or inappropriate decisions. This can have a detrimental effect on patients’ health and healthcare research. For example, AI may misdiagnose COVID-19 by relying on shortcuts like text markers on X-ray images or patient positioning instead of learning genuine medical pathology.

Privacy violations

Healthcare AI systems rely on patient data for training and making precise decisions. Using patient data without proper safeguards violates user privacy. Unauthorized data collection and storage may also lead to legal liabilities.

Security concerns

AI systems are vulnerable to malicious attacks and manipulation. Hackers could steal sensitive data, manipulate LLMs to generate unwanted decisions, and disrupt operations. These security concerns compromise the integrity, confidentiality, and availability of AI systems. For example, Microsoft created a Twitter bot that was corrupted to make racist remarks within 24 hours. When users tweeted the bot with misogynistic and racist remarks, it learned to repeat them back to users. This emphasizes how companies like Microsoft forget to consider AI vulnerability and take preventive measures.

Erosion of trust

61% of people are suspicious about trusting AI systems. Biased decisions, inaccurate insights, and fabricated diagnoses can erode public trust in LLMs. Loss of confidence in biomedical AI systems leads to wasted resources, delayed breakthroughs, and potential harm to patients.

Principles for Trustworthiness in LLMs

TrustLLM is a benchmark suite that evaluates the trustworthiness of LLMs. It includes more than 30 datasets and 16 LLMs, including both open-source and proprietary LLMs, to assess how well LLMs perform based on a set of principles. Both open-ended and closed-ended questions are used to carry out the LLM evaluation while keeping the prompts consistent through different models. 

TrustLLM established eight principles for evaluating LLMs’ trustworthiness, including:

1. Truthfulness: How true are LLM claims?

2. Safety: Does LLM avoid generating harmful content, or does it generate exaggerated safety claims?

3. Fairness: Does LLM avoid generating unfair or biased content?

4. Robustness: Does the LLM maintain its performance under various circumstances?

5. Privacy: Does LLM protect user privacy and recognize privacy-sensitive scenarios?

6. Machine Ethics: Does LLM consider ethical concerns in its content?

7. Transparency: Are users aware of how LLM makes decisions?

8. Accountability: Who is accountable for misleading outcomes?

Evaluation of Mainstream LLMs in TrustLLM Research

The TrustLLM suite allows researchers to evaluate the trustworthiness of different LLMs and pinpoint the areas of improvement. Its findings guide researchers and engineers in developing unbiased and trustworthy large language systems and promote transparency in LLMs. The key findings of the TrustLLM trustworthiness evaluation include:

1. No single LLM excels in all principles of trustworthiness. Some might perform better in truthfulness or safety, while others might be more robust and accountable.

2. Proprietary LLMs, or LLMs owned by specific companies, perform better than open-source LLMs. However, Llama2, a series of open-weight LLMs, outperforms proprietary LLMs in trustworthiness. This emphasizes the ability of open-source LLMs to excel in AI trustworthiness.

3. Many LLMs generate overly cautious outputs to harmless prompts. For example, Llama2-7b refused to answer 57% of prompts even when they were harmful, compromising the model's utility.

4. The benchmark highlights the need for transparency in LLMs to cultivate user trust among  researchers.

5. Overly calibrated models or models that offer a close match between actual and predicted outcomes might sacrifice accuracy. When achieving calibration becomes the sole focus, these models become reliant on specific datasets and might struggle to generalize unseen data, leading to inaccurate outputs. 

How Does Wisecube’s Pythia Detect Hallucinations and Ensure LLM Reliability? 

Wisecube understands the need for trustworthy AI in the noble healthcare field and offers solutions to promote the principles of trustworthiness in LLMs. These solutions allow the development of reliable AI systems.

Wisecube’s Pythia leverages Orpheus to detect hallucinations and ensure LLM trustworthiness in healthcare. Orpheus is a foundational AI graph model built upon a billion-scale knowledge graph, enabling AI to establish connections and understand the context. The hallucination detector ensures a granular evaluation of LLM responses by comparing knowledge triplets against references. After the comparison, it flags inaccurate claims, offering robust fact validation against verified data. This allows deeper analysis of LLM responses, enhancing their reliability and safety. 

The Semantic Data Model converts natural language into a Resource Description Framework (RDF), allowing the detector to analyze data from different LLM frameworks. This includes identifying gaps in natural language text and enriching the data by consulting external, trusted data sources like third-party datasets.

The hallucination detector can integrate into any existing workflow, allowing healthcare practitioners to receive real-time insights. 

With all these offerings, Wisecube allows to achieve a trustworthy AI in healthcare by emphasizing:

  • Relationship building, precise evaluation of responses, and the ability to generalize on unseen data ensure truthfulness, safety, and fairness.

  • Access to 10 billion biomedical facts and 30 million biomedical articles to generate validated outputs ensures robustness in various circumstances. 

  • Pythia offers transparent AI frameworks for hallucination detection and LLM reasoning. This addresses machine ethics and transparency principles of LLM trustworthiness.

  • Detailed audit reports of the hallucination checker to identify and mitigate biases in LLM responses emphasize Wisecube’s commitment to accountability.

Wisecube has partnered with industry leaders like Roche and Providence Healthcare and has demonstrated its ability to accelerate biomedical research and enhance medical decision-making.

Contact us today to learn more about Pythia's hallucination detector, and develop trustworthy LLMs in healthcare.

Haziqa Sajid

Share this post