Why experiments are necessary to evaluate LLMs - and how you can easily A/B test between various models with Eppo.
GPT-4, Llama 2, Claude, Orca… whether you’re looking to answer customer support tickets, generate content, sort through resumes, or query a large volume of data, the scope of business applications is growing steadily as LLMs advance in capabilities. Of course, choosing between the many LLM options, both commercial and open-source, is a problem requiring plenty of forethought — and experimentation. There are even differences between versions of a single model depending on use case — GPT-4 has a world of new capabilities vs. its predecessor GPT-3.5, but its slower speed and higher cost may make it a less optimal fit for certain use cases.
The need to A/B test different models, then, is obvious. If we care about the success and impact of a chosen model on business outcomes (as measured by defined metrics), an A/B test is our gold-standard tool for isolating the causal impact of different LLMs from other external factors. Offline evaluation all too often fails to correlate with our actual target business metrics (and can be prone to overfitting), so running online experiments in production becomes the necessity. (You can read more on why machine learning models cannot be evaluated without AB experiments here).
In this demonstration, I’ll provide a simple end-to-end example of how to run an experiment comparing LLMs using Eppo. Let’s imagine we want to compare the success of GPT-3.5 and GPT-4 for our hypothetical use case.
We will be using:
If you want to see the end result of the experiment being configured, I’ve also added it to Eppo’s preview app so you can click around and explore for yourself. (First get access to the preview app on the Eppo homepage, and then you can navigate directly to the tutorial example here)
Let's make things a bit more interesting by integrating with the OpenAI API to help us answer questions. Make sure you have signed up for the OpenAI API and have an OpenAI API key.
Test the endpoint again and verify that the answer to the ultimate question of life got slightly longer.
We now have a new way to answer questions, but how can we verify that our users prefer this option? Of course, we can run an experiment: we randomize users into different variations and then analyze the outcomes on key metrics we care about. Let's integrate the Eppo feature flagging SDK to make this trivial.
Then, create an allocation, which determines how the flag gets triggered. For an experiment, we might want to randomize users into each of the 3 variants with equal probability
Finally, turn on the flag in the testing environment so that we can evaluate it in our app. Your setup should now look like
to our imports.
In practice, you want to set up the logger to write the assignments to an assignment table in your data warehouse so that we can analyze the outcomes of the experiment by joining these assignments to the metrics we care about. But for now, printing the assignment is a useful placeholder to make sure everything works locally.
Next, let's tweak our QA endpoint to fetch the variant. We make three simple changes:
Note that we have not updated the completion code yet; first let's focus on making sure we integrated the SDK successfully. Test the endpoint either through the interactive docs or browsing directly to these URLs:
Using either the URLs above or the interactive documentation, we can now verify that the variant assigned to different users indeed leads to the API returning different endpoints.
At this point, you’ve configured the necessary assignment and flagging to run an online AB test between LLMs. We also talked about why this is a critical competency to have - “AB Experiment Infra is AI Infra”! Depending on your level of context before reading the demonstration, you may still be missing some critical components of running an experiment - like how to determine what metrics to measure, or analyze your results. Here are some recommended resources from the Eppo blog and docs to help you take your next step:
Building the Modern Experimentation Stack
The Warehouse-Native Experimentation Workflow
How to Set Up an Experiment in Eppo