The GenAI App Step You’re Skimping On: Evaluations
If your organization’s generative AI app didn’t pan out, maybe it’s because your team lacked a strong evaluation process. Here’s a recipe for how to shape one.
Topics
Column
If your organization is developing generative AI applications based on large language models (LLMs), you must have a rigorous process to evaluate the quality of each application.
An evaluation process consists of evals — automated tests designed to measure how well your LLM application performs on metrics that capture what end users care about and what is important to the business. Evals speed development by focusing effort on the areas that matter and increase the likelihood of building applications that deliver organizational value. However, the reality is that many teams underinvest in evals. The result: uneven progress and, ultimately, canceled GenAI projects or flawed applications that fail to achieve the business goal.
Business leaders, IT leaders, and developers working hand in hand to develop generative AI apps to solve business problems all benefit from a strong evaluation process. The business and IT leaders gain visibility into the app’s true quality level over the course of the development cycle, and developers can answer critical questions, like “Are we making enough progress?” or “What should the next dev cycle focus on?” or “Is the application ‘good enough’ to deploy?”
Get Updates on Leading With AI and Data
Get monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.
Please enter a valid email address
Thank you for signing up
Here, I provide practical guidance that is based on my research, hands-on work building applications, and lessons from fellow practitioners for setting up an effective, iterative evaluation process. This type of process drives rapid progress and sets the organization up for GenAI app success.
Building LLM-Based Applications: Three Basics
Before diving into how to evaluate GenAI apps, let’s briefly review the basic approach for building an LLM-based application. First, the team chooses an LLM as the backbone (such as OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, or Meta’s Llama 3.2) and then “adapts” the LLM to the specific requirements of the application. Three adaptation techniques are commonly used: prompt engineering, retrieval-augmented generation (RAG), and instruction fine-tuning.
1. Prompt engineering. Prompting, or instructing the LLM to perform the task, may be sufficient when the task you’re tackling is one that could be accomplished by a typical human without specialized training. Let’s say your team is building an LLM-based application to read product reviews from an e-commerce site to find mentions of product issues or defects. You can share the text of each review in the prompt and ask, “Does the following review indicate a potential product defect? Answer yes or no.” Today’s LLMs can handle this kind of question with accuracy. That said, small changes in the wording of the prompt can cause big changes in the output, so the team may need to do some tweaking, or prompt engineering.
2. Retrieval-augmented generation. In some cases, you may need to provide the LLM with access to proprietary data. For example, an application to automate/augment the process of responding to a request for proposals needs access to company data and responses to previous RFPs. RAG offers a solution: The most relevant facts and content pertaining to the question can be extracted from company data, added to the prompt, and sent to the LLM, which can then provide a helpful response.
3. Instruction fine-tuning. If the task deals with jargon and specialized knowledge (like in medical notes or legal briefs), you may need to take a step beyond prompt engineering and RAG. Now it’s time for instruction fine-tuning — training the LLM with input-output examples drawn from the task domain. This approach can also help in cases where you’re struggling to articulate how, exactly, the LLM should perform the task or whether you require a specific style, tone, or format in the output.
What do these options look like in practice? Typically, development teams start with prompt engineering. If access to company-specific data is required, RAG will also be part of the mix. Developers will iterate through prompt engineering and tune various RAG settings. If progress proves to be unsatisfactory, they may invest in instruction fine-tuning (or perhaps even more advanced techniques, like reinforcement learning from human feedback).
Why GenAI Apps Need a Rigorous Evaluation Process
Once you understand the building steps, it’s clear why the quality questions we hinted at earlier matter a great deal. As the team iterates, how does it know that it’s making progress? How does the team decide that prompt engineering isn’t enough and that instruction fine-tuning is needed? How does the team decide whether the application is good enough to ship? After deployment, how will the team keep the application in sync with changes in user behavior? As LLM vendors update their models, how will the team ensure that the application works as well with newer models?
The answer to all of these questions is, by implementing a rigorous yet easy-to-execute evaluation process. A strong process feeds a carefully chosen set of inputs to the LLM and evaluates the LLM outputs based on a set of metrics specific to the use case. Ideally, these evals are executed after every significant change to the application: The development team can immediately see which metrics improved and which metrics worsened. They can drill down to get insight into which components of the application are most in need of improvement and thus where to focus next. And by regularly running evals after application launch, teams can quickly detect performance degradations due to changes in external factors.
A robust evaluation process has a few key ingredients. Let’s break it down.
Gather the questions. A good starting point is to assemble a suite of inputs/questions representative of what end users are likely to ask the application. These should also include corner cases (that is, inputs with incomplete, wrong, or irrelevant information). Using a few hundred well-chosen examples (ideally) makes it more likely that your team has a diverse and representative set of inputs.
As LLM vendors update their models, how will the team ensure that the application works as well with newer models?
Since the application is under development, we won’t have actual inputs from end users, so synthetic data techniques can be used to generate test cases. The basic idea is to prompt an LLM to generate the kinds of inputs you want. Many approaches have been developed and a number of tools are available, including LlamaIndex and Ragas.1 Synthesized inputs will need to be checked for quality and relevance, and a few iterations of prompting and quality checks may be required to assemble a good test set.
Conduct an error analysis. Next, the team should run each question in the suite through the application, manually examine the output for each question, assess whether the output adequately answers the question while satisfying other criteria (such as tone, style, and bias), identify all of the errors, and correct the response, if needed. Actual end users (or someone who can accurately advocate for end users) should play a central role in this step. This error analysis is performed in the spirit of data-centric AI practices that are common in machine learning and data science work but less common in software engineering.2 Studying the output data is time-consuming and laborious (and therefore often skipped), but this work is indispensable: It enables the team to develop a deep understanding of the errors that occur, to define tests to catch those errors, and to adapt the LLM to avoid making those errors.
Automate the error analysis. After the first round of error analysis is complete, your team will have a test set of inputs and ground truth outputs (that is, they are guaranteed to be correct). The team will also have identified different kinds of errors and will have ideas for how to engineer the prompt (or otherwise adapt the LLM) to improve the application. But after implementing these ideas, how will the team assess whether these changes improved the application? Repeating the painstaking manual error analysis described above after every change to the application isn’t practical. What you need is a way to automate the error analysis.
It is possible to write code to automate error checking in some cases. For example, an LLM prompt may include a list of general do’s and don’ts, such as “don’t create and share URLs” or “don’t output email addresses.” It’s straightforward to write code to check whether the LLM response contains any URLs or email addresses. This approach also works when outputs have a simple or clearly defined structure (such as multiple-choice questions).
Studying the output data is time-consuming and laborious (and therefore often skipped), but this work is indispensable.
But this approach has its limits. For example, how can you check whether an LLM summarization of an article is indeed a faithful summary? How can you check whether an LLM response avoids the use of slang? For these types of open-ended questions, an approach known as LLM-as-a-judge comes into play.3 The team feeds the input, the LLM output, and a ground-truth correct answer to another LLM and then prompts it to evaluate whether the first LLM’s output matches the correct answer and satisfies other stated criteria. Since a “Judge LLM” can hallucinate and make mistakes (just like any other LLM), the team will want to compare the LLM’s evaluations to human evaluations across the suite of questions and iterate until the evaluations are mostly in agreement.4
For both direct error checking and LLM-as-a-judge evals, LLM vendors have created code templates that implement a variety of useful checks. By making use of these libraries, your team can implement evals quickly.
A Supporting Structure for Rapid Progress
At this point, the team has built a foundation that can support rapid progress: After every change to the application, the team can “push the button,” run the error-checking code, and quickly glean which errors improved, which errors regressed (got worse because of unintended side effects of other changes), and which errors should be tackled next.
When the application ultimately passes the evals and the team is getting ready to launch, there’s an important final step: The team must instrument the application to log the details of every user interaction. These logs are invaluable to ensure that the application stays relevant after launch.
After the app goes live with end users, the team must regularly examine the interaction logs. Inputs posed by actual users will likely be much more varied than the test suite that your team synthesized. As a result, the error-checking code may not catch all of the ways in which the LLM is responding incorrectly or unexpectedly to these new inputs. Therefore, the team needs to repeat the process described earlier: Add new inputs to the test suite, iterate to ensure that the LLM correctly handles these inputs, and update the error-checking code to cover these new cases. A piece of good news is that this process can be made easier using special-purpose tools that support the creation and execution of evals, such as Promptfoo, Humanloop, and DeepEval.
This process also makes it easier to assemble data sets for instruction fine-tuning where your team needs to generate inputs and high-quality answers. An eval process helps with both aspects: The team can efficiently generate inputs using the synthetic data capability (as well as actual user inputs) and improve output quality by automatically filtering out incorrect outputs using the error-checking code.
A rigorous evaluation process is a key driver of success for LLM application development projects and leads to better applications and more satisfied users — even as models and user needs change. Therefore, business and IT leaders must provide their development teams with the time and budget to invest in evals and then hold them accountable for making evals an indispensable part of the LLM application development process.
References
1. R. Liu, J. Wei, F. Liu, et al. “Best Practices and Lessons Learned on Synthetic Data,” arXiv, revised Aug. 10, 2024, https://arxiv.org.
2. R. Ramakrishnan, “How to Build Good AI Solutions When Data Is Scarce,” MIT Sloan Management Review 64, no. 2 (winter 2023): 48-53.
3. L. Zheng, W.-L. Chiang, Y. Sheng, et al. “Judging LLM-as-a-Judge With MT-Bench and Chatbot Arena,” arXiv, revised Dec. 24, 2023, https://arxiv.org.
4. H. Hamel, “Creating a LLM-as-a-Judge That Drives Business Results,” Husain Hamel (blog), Oct. 29, 2024, https://hamel.dev.