Benchmark

API

Haziqa Sajid

Episode 006: A History of NLP and Wisecube’s Journey

Episode 006: A History of NLP and Wisecube’s Journey

Episode 006: A History of NLP and Wisecube’s Journey

Haziqa Sajid

Jun 6, 2025

TL;DR: Wisecube’s evolution traces the arc of modern NLP, from early search engines and statistical models to knowledge graphs and LLM-powered interfaces. In this episode, Vishnu and Alex reflect on the tools, trade-offs, and breakthroughs that shaped that path, and why foundational models and graph-aware LLMs may hold the key to what’s next.

Getting relevant text out of a corpus used to mean counting term frequencies and praying your TF-IDF ranker guessed the user’s intent. A decade and a half later, we’re extracting evidence-backed triples with GPT-class models and wiring them straight into biomedical knowledge graphs that can answer multi-hop questions. The progress is real, but so are the new challenges: hallucination, evaluation, and ever-growing data complexity.

In the latest episode of Gradient Descent, Vishnu and Alex mark the acquisition by tracing the evolution of NLP through their own work. They reflect on the early challenges of balancing recall and precision in keyword search, their shift to statistical models in healthcare, and the rise of biomedical knowledge graphs that outperformed deep learning baselines.

Today, large language models are serving as both the interface and evaluator for knowledge systems. This episode is a reflection on that journey and a preview of what comes next.

A New Chapter Begins

The episode begins with light small talk, but quickly shifts to a major update. After more than eight years, Wisecube has been acquired by John Snow Labs. Vishnu calls it “the exit of Wisecube,” but frames it as a continuation of their work, not a departure from it.

Rather than diving into future plans, the hosts choose to reflect. This episode traces their journey through fifteen years of applied NLP: starting with search engines, moving through medical classification and knowledge graphs, and ending with large language models. It’s a technical recap shaped by their own experiences building real systems.

Alex jokes about the idea of doing a recap so early in the podcast’s life, referencing a flashback episode from Community. But the format fits. With the acquisition as a milestone, this is a moment to step back and walk through how their thinking and tools evolved.

Search Engines and Expert Systems (2008–2013)

Wisecube’s early work centered on enterprise search. The team used Lucene and TF-IDF to build ranking systems that were fast and reliable. However, these systems were limited in how they understood meaning. They relied on exact keyword matches, rather than understanding the context or intent behind the query.

To make search more relevant, they introduced a domain-specific method:

  • Query expansion with related terms. In medical use cases, a search for “heart attack” would also return results for “myocardial infarction.” This improved recall by recognizing clinically similar terms.

  • Custom indexes built for each domain. Instead of relying on a single general-purpose index, they created specialized corpora with domain-relevant language.

But better recall brought trade-offs. Precision dropped, and it was hard to tell which results were actually useful. The team also faced a cold-start problem. Without user interaction data, there was no feedback loop to help the ranking model improve. TF-IDF could only go so far.

Their first large-scale test came during a medical cohort retrieval challenge. The task involved matching trial descriptions to a database of anonymized patient records. General-purpose search methods didn’t work well, but filtering down to medical terms helped. That simple change made the system much more effective. 

Even when accuracy was average, users consistently accepted the system’s top suggestions. This early gap between statistical performance and user trust hinted at a deeper pattern: It was an early version of what the field would later call hallucination.

Pivot to Healthcare and Statistical NLP (2013–2017)

After years of building general-purpose search systems, the team made a deliberate shift. Instead of trying to build models that worked across every domain, they narrowed their focus to one: healthcare. As Vishnu explains,

“One of the ways the company … solved some of these cold start problems is we started pivoting into a specific vertical. In this case, it was healthcare.”

The move solved a key limitation in their earlier work. Without enough user feedback, their general systems struggled to improve over time. Focusing on healthcare allowed them to build datasets, tune models, and ship tools that worked out of the box. And because clinical language is highly structured, it gave them a strong foundation for more advanced techniques.

The first major shift came through vocabulary control. General-purpose systems were overwhelmed by irrelevant tokens. Precision improved as soon as the team filtered those out. As Alex shares,

“We built something that essentially filtered down the vocabulary to just the medical vocabulary … by filtering everything down pretty aggressively to just medical terms, it helped us do well.” 

That insight carried directly into their next project: computer-aided coding. The task was to classify physician notes using ICD billing codes, which had grown in complexity and volume. Rule-based approaches no longer scaled. With thousands of possible codes per note, multi-label classification became the only viable path.

The shift from bag-of-words to supervised learning created a noticeable jump in performance. Statistical models trained on real clinical data began to outperform hand-tuned heuristics. What started as a search problem became a classification pipeline with a clear target and measurable outcomes.

Going vertical enabled the team to embed domain knowledge like medical vocabularies, billing standards, and even code hierarchies directly into their models. That grounding turned a generic search engine into a reliable tool for real-world clinical use.

Deep Learning and Drug Discovery (2017–2020)

As deep learning advanced, Wisecube’s work moved from document classification to scientific prediction. The team began applying NLP techniques to chemistry, where the goal wasn’t just to analyze language but to model biological activity.

BERT marked a turning point. Unlike earlier models based on word counts, it captured structure and context, making language representations more useful. As Alex puts it,

“Before 2015 it was all bag-of-words … once you get to the point where you have BERT, you can actually capture the structural information … really sort of like this jump in some of these capabilities.” 

That same shift in representation carried over to molecules. Using SMILES strings (a text-based format for chemical structures), the team trained RNNs to predict how compounds might behave. These in-silico models helped prioritize which candidates to test in the lab.

One of their first pilots applied this approach to tuberculosis. Partnering with IDRI, they used deep-learning-based QSAR models to identify promising drug candidates against resistant TB strains. As Vishnu shares,

“We worked with IDRI … They were researching tuberculosis … trying to come up with new formulations of potential drug candidates … In the discovery world it’s known as QSAR …”

This marked a clear evolution. The team was no longer just building NLP pipelines. They were applying those techniques to drug discovery, using machine learning to guide real-world experiments.

Biomedical Graphs and the Foundation Phase (2020–2022)

By 2020, Wisecube moved from solving individual tasks to building infrastructure that could support many. The team began developing a large biomedical knowledge graph using public sources like PubMed and Wikidata. The new goal was to go beyond linking data and learn from structure at scale.

Instead of crafting features for each use case, they trained global embeddings across the entire graph. These representations turned out to be stronger than expected. When tested against COVID-related QSAR data, the general graph outperformed models trained specifically for that task. As Alex discussed,

“Our very big, very general … knowledge graph … performed better at predicting relationships than the model trained on that data.” 

They applied the same embeddings to biomarker prediction, showing that the model could transfer to genotype-therapy links without rebuilding anything.

“We did the same thing for biomarker prediction … using this knowledge graph … we can actually predict these biomarkers.” 

This foundation-plus-fine-tune approach echoed the emerging architecture of large language models. As Vishnu noted,

“We had this global graph, and then we trained a local model on a specific task … very similar to LLMs — a foundational model, and then supervised fine-tuning for each task.”

Scale became the differentiator. The larger the graph, the stronger its predictions. This phase marked a shift from one-off models to systems designed to generalize, where one graph could support a wide range of applications.

Modern Era: LLMs as Graph Interfaces (2022-Present)

As large language models matured, Wisecube didn’t abandon its graph-based foundation. Instead, the team started building natural language interfaces on top of it. The goal was to let users interact with structured biomedical knowledge through plain English, without losing the precision and traceability of the underlying system. As Vishnu puts it,

“…the whole LLM thing happened… we were trying to see how we can merge the power of the LLMs—use that almost like an interface into this knowledge graph.” — Vishnu

This wasn’t a handoff from graphs to models. The graph remained the source of truth. What changed was how users could access it. With new tooling, LLMs could translate questions into structured queries, fetch relevant answers, and ground responses in known entities. This kept the generation tethered to facts rather than patterns. As Vishnu discussed,

“We built tooling … how do we query the graph and use the LLM to power some of these things as well? The underlying foundational network model was still … the power behind all of this.” — Vishnu

That architecture led to the Pythia project. The idea was to use LLMs not just to generate text, but to extract factual claims, verify them against source documents, and decide whether to insert them back into the graph. It was a practical step toward more trustworthy summarization and retrieval-based systems.

This phase didn’t replace what came before. It added a layer of access, blending conversational fluency with graph-level accuracy. The same principle still applied: scale the foundation, then build flexible, focused layers on top.

Looking Forward with John Snow Labs

In the closing part of the episode, the conversation shifts to what lies ahead. Now that Wisecube is part of John Snow Labs, the focus turns to how their work will expand.

Vishnu sees a clear path forward. The knowledge graph infrastructure Wisecube has spent years building will now connect to Spark NLP, a clinical text processing platform already in use across the healthcare industry.

“…take all of these learnings and innovations we’ve done over the years and join it with some of John Snow’s really cool tech … like Spark NLP.” — Vishnu

For Alex, the transition feels familiar. Spark NLP was one of his first major open-source projects. Rejoining that ecosystem means they can move faster with tools they already understand.

The team sees the acquisition as a step forward. Their research focus will stay the same, but now it has a wider reach. John Snow Labs brings the infrastructure to move ideas into production, especially in clinical settings.

The podcast will continue. So will the work on language models, graphs, and new ways to combine them. As Alex assured,

“The podcast is continuing, just to make sure everyone knows … I think it’s going to be great.” 

From Search to Scale: What This Journey Reveals

Wisecube’s story follows the arc of modern NLP. It started with keyword search and TF-IDF, where relevance was limited to word overlap. Then came statistical models and classification pipelines built for clinical tasks.

Deep learning opened new doors, from contextual embeddings to in-silico chemistry. Graphs brought structure and reusability, forming a base that could support many tasks. Large language models now sit on top, making that structure accessible and verifiable.

At every stage, the team built systems that could adapt. The approach stayed consistent: create a strong foundation, then fine-tune for what comes next. That same mindset now guides their work at John Snow Labs.

If this episode gave you something to think about, we invite you to explore more. You can find all Gradient Descent episodes on YouTube, or subscribe to our LinkedIn newsletter for curated research links, transcript highlights, and fresh insights from behind the scenes.

Where is your team on this journey? Are you still tuning search? Exploring domain-specific graphs? Figuring out how to trust LLM outputs? Let us know what you're building and what’s holding you back.

Have questions or ideas? Email us at oksana@wisecube.ai.

Haziqa Sajid

Share this post