Benchmark

API

Back

Haziqa Sajid

Episode 004: The Future of Prompt Engineering – From Strings to Systems

Haziqa Sajid

May 1, 2025

TL;DR: Prompt engineering is evolving from intuition-driven guesswork into a disciplined practice grounded in structure, evaluation, and modular design. In this episode, Vishnu Vettrivel and Alex Thomas examine why prompting remains a fragile part of AI development and how new tools like DSPy are helping teams build more reliable, scalable, and maintainable systems around it.

In the fourth episode of Gradient Descent, hosts Vishnu Vettrivel and Alex Thomas tackle a recurring challenge in the world of AI: Why does something as essential as prompting still feel so unpredictable?

Language models today are astonishingly capable, yet even small shifts in wording can throw off their output. Developers often find themselves stuck in loops of trial and error, chasing consistency with tools that feel more like folklore than engineering. What works one moment might fail the next, and nobody can quite explain why.

At the heart of this issue is a deeper discomfort. Prompts, despite driving the behavior of complex systems, are still treated as one-off instructions. They are rarely versioned, evaluated, or reused with intent. In many teams, the prompt exists in a text box or a notebook cell, disconnected from the infrastructure that supports it.

This informality would be unthinkable in any other part of a production pipeline. Yet in the world of LLMs, it remains the norm.

The result is an uneasy paradox. We are building advanced AI systems on top of some of the least structured components in the stack. And as Vishnu and Alex discuss, the costs of that looseness are starting to show.

The Problem with Prompt Engineering Today

Prompt engineering was never supposed to carry this much weight.

At first, prompts were a workaround, an informal way to shape model behavior without retraining. But as language models made their way into production systems, prompts became load-bearing structures.

Prompt engineering now influences everything from chatbot tone to whether a system returns the right legal clause or writes syntactically correct code. And yet, the tooling and practices surrounding prompts haven’t kept up. As Vishnu notes:

"It brings back the rigor that you need while building probabilistic, non-deterministic applications and pipelines. Because by definition, they're very non-deterministic. They're just going to do what they're going to do, and depending on what examples you feed, the results might vary drastically."

Developers find themselves chasing output quality by tweaking punctuation, reordering clauses, or using oddly specific phrasing. There is rarely a clear answer to what changed or why something stopped working. This kind of manual tuning doesn’t scale. It doesn’t transfer across domains. And perhaps most importantly, it doesn’t encourage repeatability. As Alex notes:

“It’s like tuning the weights of a neural network manually. You don’t really have a loss function, you don’t really have gradients—you’re just nudging things around and hoping something better comes out. It’s all very ad hoc.”

This gap between what prompts control and how they’re managed is growing more obvious as LLM use expands. More companies are deploying AI into production workflows. More systems are expected to behave consistently across users, data inputs, and time. And more teams are realizing that their entire stack can become unstable without a systematic way to design and evaluate prompts.

Enter DSPy: What Is It?

The frustration with prompts isn’t that they fail. It’s that they fail unpredictably. One version works. Another doesn’t. There’s no trail to follow, no abstraction to inspect. Just another tweak to try. For a long time, developers had no choice but to live with this uncertainty.

DSPy changes this equation.

Rather than treating prompts as strings passed between boxes, DSPy treats them as modular components. These components are units that can be defined, composed, and reasoned about.

A DSPy module has a defined signature, structured inputs, and expected outputs. It behaves like a function. More importantly, it can be tested like one. As Alex points out:

“It just becomes more like programming with functions and modules… instead of just this arcane framework that has its own concepts that take you away from regular programming style.”

With DSPy, developers no longer have to choose between flexibility and structure; they gain:

Modularity: Prompts are packaged into standalone components that can be reused across workflows.
Clarity: Each module has a clear signature, defined inputs and outputs, making behavior more predictable.
Evaluability: Prompts can be measured, scored, and optimized as part of a pipeline, not just spot-checked.
Accessibility: The syntax mirrors familiar Python code, lowering the barrier for non-specialists.

And for those coming in from outside the ML world, that last point matters. DSPy doesn't require learning a new paradigm. There’s no exotic syntax or black-box abstraction. Just code that looks and feels like code.

Key Components of DSPy

DSPy has transformed the process of how we build with language models. Behind its clean syntax and accessible interface is a set of powerful abstractions that give developers control, visibility, and confidence.

The episode walks through these ideas in detail, revealing how each piece of the framework supports a more scalable and maintainable approach to prompt engineering.

Signatures: Defining the Contract

Every DSPy module starts with a signature. This is not just a naming convention. It is a declarative contract that defines what the module is supposed to do.

A signature spells out:

The inputs the model expects
The outputs it should generate
A docstring that captures the intent in natural language

This small structure makes a big difference. It turns the prompt into a function-like object that can be composed, inspected, and reused across different workflows. Developers can now reason about prompts the same way they reason about code: with clarity and intent.

Modules: Packaging Prompt Logic

Wrapped around the signature is the module, which contains the logic for how the prompt operates. These modules can stand alone or be chained into larger pipelines, allowing developers to build complex behaviors from simpler components.

What makes this powerful is composability. You don’t need a monolithic block of prompt text to guide a system. Instead, you can build behaviors piece by piece, testing and improving each part independently.

Evaluators: Measuring What Matters

Prompt engineering has long suffered from a lack of meaningful evaluation. Spot-checking outputs or judging results by feel isn’t enough, especially in production.

DSPy brings evaluation to the forefront. Developers can attach evaluators to modules that:

Measure correctness or relevance
Compare the results before and after changes
Quantify progress across datasets

These evaluators can even be powered by LLMs, forming the basis for LLM-as-a-Judge workflows covered in earlier episodes.

Optimizers: Refining Without Rewriting

Once modules and evaluators are in place, DSPy makes it easy to optimize prompts. But, as the episode highlights, this isn’t mandatory:

“You don’t have to use the optimizers. In fact, the optimizers are completely optional. You can just set up your signature modules, set up your examples, and then you have your evaluators and that’s it. Now you have a pipeline that you can measure how good it’s doing.”

Optimization is treated as a layer you can add once you’re ready. DSPy supports different optimization strategies, from simple tuning to more advanced search techniques, all while keeping the original structure intact. More on this in the next section.

Why This Matters: Optimization Meets Abstraction

Prompt engineering has always relied on intuition. What it lacked was structure. Manual tuning only worked for a while. However, as systems scaled and complexity grew, those ad hoc methods began to show their limits.

DSPy introduces the missing foundation by aligning abstraction with optimization. It offers both instead of forcing developers to choose between flexibility and control. Each component is designed to be clean, composable, and capable of continuous improvement. This makes prompting something that can evolve without breaking. As Alex notes:

“You don’t have to use the optimizers. In fact, the optimizers are completely optional. You can just set up your signature modules, set up your examples, and then you have your evaluators and that's it. Now you have a pipeline that you can measure how good it's doing. You can baseline it.”

The most significant shift, however, is not in structure alone, but in behavior. With DSPy, prompting becomes a learnable function rather than a fixed artifact. Prompts are no longer isolated inputs that developers tweak endlessly. Instead, they are embedded in a dynamic system that adapts, refines itself, and in some cases, teaches.

And the performance gains reflect that shift:

When paired with a well-structured DSPy pipeline, GPT-3.5 achieves 88% accuracy on GSM8K, a benchmark focused on math reasoning, without any additional fine-tuning.
Even more notably, LLaMA 2–13B can match GPT-3.5’s performance under the same conditions. This highlights how effective orchestration can be, rivaling the performance of larger models.
For cost-sensitive or edge scenarios, smaller models like T5-Large can be fine-tuned using DSPy programs as teachers. This approach enables high performance without the need for significant infrastructure.

Together, these outcomes signal a new phase in LLM development where the quality of the surrounding system matters as much as the model itself.

Real-World Impact: Where Modular Pipelines Win

The strength of DSPy isn’t just in its abstractions. Its real impact shows when those abstractions are applied to practical problems. In real-world settings, DSPy’s modular pipelines consistently deliver results that rival, and sometimes exceed, what much larger, proprietary models can achieve.

Take, for example, the challenge of math word problems, which require multi-step reasoning. DSPy pipelines combine two prompting strategies: Chain of Thought, which guides the model through intermediate steps, and Reflection, which encourages it to evaluate and refine its own reasoning. As Vishnu notes:

“You mentioned we switched from just a basic prediction to using chain of thought, and how easy it was to do. You just changed one line of code and immediately saw an improvement in quality.”

The performance gains reflect a broader shift in how prompting is evolving. Chain of Thought guides the model through intermediate steps. Reflection then prompts it to critique and refine its reasoning. Together, this pairing proved more effective than even carefully crafted, human-written prompts. And the impact goes beyond accuracy.

Pipelines built this way are modular by design: easier to reuse, extendable, and adaptable across new tasks. This direction is reinforced by findings from a recent study, “DSPY: Compiling Declarative Language Model Calls Into Self-Improving Pipelines” presented at ICLR 2024.

The research introduces DSPy as a high-level programming model for LLMs, positioning prompt logic not as a fixed string but as part of a pipeline that can be executed, inspected, and improved over time.

In more complex domains like multi-hop question answering, DSPy continues to stand out. On the HotPotQA benchmark, the team built a pipeline that combined ReAct-style reasoning with a customized multi-hop retrieval strategy. This setup enabled the model to gather evidence progressively, link facts across documents, and generate grounded, high-quality responses.

What makes these results even more compelling is the fact that they were achieved using smaller, open-weight models. In several cases, DSPy pipelines not only held their own against larger models such as PaLM and Codex, but outperformed through orchestration and design.

Together, these case studies reinforce the episode’s core insight: effective prompting is no longer about writing better text. It is about building systems that reason, revise, and adapt, using structure rather than size to drive performance.

Implications for AI Builders

If you’re building with large language models, it’s time to treat prompting as a programmatic discipline. As tools like DSPy mature, they offer a practical pathway to transform fragile prompt strings into structured, maintainable systems. Here’s how to think about that shift in your own development work.

1. Build with Reusability in Mind

Start treating prompts as modular components. In DSPy, each module comes with a defined signature, making it easier to reuse across applications, tasks, and teams. This approach turns one-off experiments into reusable infrastructure.

2. Embrace Evaluation as Part of the Design

Don’t wait until the end to test your outputs. Attach evaluation logic to your prompts from the start. Whether you use simple checks or LLM-based scoring, you’ll gain visibility into what’s working, what’s improving, and what’s drifting.

3. Design for Adaptability Across Models

When your prompting logic is abstracted from the model, you can swap in different backends with minimal effort. This allows you to test across GPT, Claude, or open-weight models like LLaMA without rewriting your core logic.

4. Treat Prompting Like a Domain-Specific Language

Stop thinking of prompting as crafting better sentences. Begin thinking in layers of logic. Your modules can function like a lightweight DSL, defining how your system thinks, plans, and responds across use cases. Structure replaces trial-and-error.

5. Build for Maintenance, Not Just Performance

Your system needs to perform well today, but it also needs to make sense tomorrow. Clear structure, versioned modules, and visible evaluation pipelines make tracing errors easier, improving logic, and onboarding collaborators.

Final Thoughts: Prompting as a New Programming Paradigm

DSPy represents a turning point in how AI builders approach prompting. No longer a guessing game or a fragile set of text strings, prompting becomes a programming paradigm that can be compiled, optimized, evaluated, and scaled like any other part of a modern system.

Throughout this episode, Vishnu and Alex highlight how this shift is already reshaping real-world applications. Modular pipelines, structured evaluation, and model-agnostic design are no longer aspirational goals. They are becoming best practices for building AI systems that are reliable, maintainable, and adaptable across domains.

If this episode sparked your interest, there is much more to explore. You can listen to all episodes of Gradient Descent on YouTube or subscribe to our official podcast newsletter on LinkedIn. We regularly share episode updates, behind-the-scenes insights, curated research notes, and recommended readings directly to your inbox.

If you have questions, suggestions, or feedback, feel free to email us at oksana@wisecube.ai. We would love to hear your thoughts and continue the conversation!

Haziqa Sajid

Share this post