A/B Testing
Four Customer Characteristics That Should Change Your Experiment Runtime
Dialing in your experiment planning beyond just a sample size calculation
Learn more
GPT-4, Llama 2, Claude, Orca… whether you’re looking to answer customer support tickets, generate content, sort through resumes, or query a large volume of data, the scope of business applications is growing steadily as LLMs advance in capabilities. Of course, choosing between the many LLM options, both commercial and open-source, is a problem requiring plenty of forethought — and experimentation. There are even differences between versions of a single model depending on use case — GPT-4 has a world of new capabilities vs. its predecessor GPT-3.5, but its slower speed and higher cost may make it a less optimal fit for certain use cases.
The need to A/B test different models, then, is obvious. If we care about the success and impact of a chosen model on business outcomes (as measured by defined metrics), an A/B test is our gold-standard tool for isolating the causal impact of different LLMs from other external factors. Offline evaluation all too often fails to correlate with our actual target business metrics (and can be prone to overfitting), so running online experiments in production becomes the necessity. (You can read more on why machine learning models cannot be evaluated without AB experiments here).
In this demonstration, I’ll provide a simple end-to-end example of how to run an experiment comparing LLMs using Eppo. Let’s imagine we want to compare the success of GPT-3.5 and GPT-4 for our hypothetical use case.
We will be using:
If you want to see the end result of the experiment being configured, I’ve also added it to Eppo’s preview app so you can click around and explore for yourself. (First get access to the preview app on the Eppo homepage, and then you can navigate directly to the tutorial example here)
Follow along in the Github repository where this demonstration lives.
and activate it
Next, install the libraries we will be using
Now, start the app using
Next up, let's create an endpoint that can answer questions. First, set up the API endpoint by adding the following endpoint:
Let's make things a bit more interesting by integrating with the OpenAI API to help us answer questions. Make sure you have signed up for the OpenAI API and have an OpenAI API key.
For the fun part, we create a function that uses OpenAI's chat completion to answer our question by adding the following code
Test the endpoint again and verify that the answer to the ultimate question of life got slightly longer.
We now have a new way to answer questions, but how can we verify that our users prefer this option? Of course, we can run an experiment: we randomize users into different variations and then analyze the outcomes on key metrics we care about. Let's integrate the Eppo feature flagging SDK to make this trivial.
Then, create an allocation, which determines how the flag gets triggered. For an experiment, we might want to randomize users into each of the 3 variants with equal probability
Finally, turn on the flag in the testing environment so that we can evaluate it in our app. Your setup should now look like
to our imports.
Next, we set up our logger and initialize the SDK client, which fetches the allocation configuration we just set up in the Eppo UI
In practice, you want to set up the logger to write the assignments to an assignment table in your data warehouse so that we can analyze the outcomes of the experiment by joining these assignments to the metrics we care about. But for now, printing the assignment is a useful placeholder to make sure everything works locally.
Next, let's tweak our QA endpoint to fetch the variant. We make three simple changes:
Note that we have not updated the completion code yet; first let's focus on making sure we integrated the SDK successfully. Test the endpoint either through the interactive docs or browsing directly to these URLs:
Make sure to try the same user a couple of times to verify that indeed the exact same variant shows up every time. In my case, Alice and Bob consistently see the GPT4 variant, while Charlie sees GPT3.5, but you will see different results. Furthermore, you should see the assignments log to your terminal as well:
Using either the URLs above or the interactive documentation, we can now verify that the variant assigned to different users indeed leads to the API returning different endpoints.
At this point, you’ve configured the necessary assignment and flagging to run an online AB test between LLMs. We also talked about why this is a critical competency to have - “AB Experiment Infra is AI Infra”! Depending on your level of context before reading the demonstration, you may still be missing some critical components of running an experiment - like how to determine what metrics to measure, or analyze your results. Here are some recommended resources from the Eppo blog and docs to help you take your next step: