Why experiments are necessary to evaluate LLMs - and how you can easily A/B test between various models with Eppo.
Sven Schmit
September 12, 2023
GPT-4, Llama 2, Claude, Orca… whether you’re looking to answer customer support tickets, generate content, sort through resumes, or query a large volume of data, the scope of business applications is growing steadily as LLMs advance in capabilities. Of course, choosing between the many LLM options, both commercial and open-source, is a problem requiring plenty of forethought — and experimentation. There are even differences between versions of a single model depending on use case — GPT-4 has a world of new capabilities vs. its predecessor GPT-3.5, but its slower speed and higher cost may make it a less optimal fit for certain use cases.
The need to A/B test different models, then, is obvious. If we care about the success and impact of a chosen model on business outcomes (as measured by defined metrics), an A/B test is our gold-standard tool for isolating the causal impact of different LLMs from other external factors. Offline evaluation all too often fails to correlate with our actual target business metrics (and can be prone to overfitting), so running online experiments in production becomes the necessity. (You can read more on why machine learning models cannot be evaluated without AB experiments here).
In this demonstration, I’ll provide a simple end-to-end example of how to run an experiment comparing LLMs using Eppo. Let’s imagine we want to compare the success of GPT-3.5 and GPT-4 for our hypothetical use case.
We will be using:
FastAPI to create a simple webserver
Eppo's Python SDK to fetch feature flag configuration
OpenAI API to answer questions
If you want to see the end result of the experiment being configured, I’ve also added it to Eppo’s preview app so you can click around and explore for yourself. (First get access to the preview app on the Eppo homepage, and then you can navigate directly to the tutorial example here)
Now use the interactive docs or navigate to http://127.0.0.1:8000/qa?question=What%20is%20the%20answer%20to%20the%20ultimate%20question%20of%20life%3F to verify everything works as expected.
Integrating the OpenAI API
Let's make things a bit more interesting by integrating with the OpenAI API to help us answer questions. Make sure you have signed up for the OpenAI API and have an OpenAI API key.
Let's store the API key locally and make sure it does not get checked into github by accident. Copy the .env.dist file to .env from the top-level directory like so:
cp src/.env.dist src/.env
Now open src/.env and add your OpenAI API key.
Next, we install the openai and python-dotenv libraries (the latter is used to read the API key you just stored).
Add the following to the top of your app.py file:
import os
import openai
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
print(openai.api_key[:6]) # optional to test you indeed loaded your API key successfully
For the fun part, we create a function that uses OpenAI's chat completion to answer our question by adding the following code
Test the endpoint again and verify that the answer to the ultimate question of life got slightly longer.
Creating an experiment
We now have a new way to answer questions, but how can we verify that our users prefer this option? Of course, we can run an experiment: we randomize users into different variations and then analyze the outcomes on key metrics we care about. Let's integrate the Eppo feature flagging SDK to make this trivial.
Creating an API key
First, create an Eppo API key in the test environment. The name is unimporant, consider calling it AI testing key, but make sure you copy paste the actual key into the your src/.env file. For the correct format, compare your .env file with the .env.dist file.
Setting up a feature flag
With the API key setup, we can now create a feature flag and evaluate it in our app. Create a new feature flag called AI demo model version and we can use the automatically generated key ai-demo-model-version. Let's create three variations:
GPT3.5
GPT4
Fixed (for always responding with the answer 42)
Then, create an allocation, which determines how the flag gets triggered. For an experiment, we might want to randomize users into each of the 3 variants with equal probability
Finally, turn on the flag in the testing environment so that we can evaluate it in our app. Your setup should now look like
Adding the feature flag to the app
First up, install the Eppo Python SDK using pip install eppo-server-sdk. We should now be able to import the SDK into our script by adding
import eppo_client
from eppo_client.config import Config
from eppo_client.assignment_logger import AssignmentLogger
to our imports.
Next, we set up our logger and initialize the SDK client, which fetches the allocation configuration we just set up in the Eppo UI
In practice, you want to set up the logger to write the assignments to an assignment table in your data warehouse so that we can analyze the outcomes of the experiment by joining these assignments to the metrics we care about. But for now, printing the assignment is a useful placeholder to make sure everything works locally.
Next, let's tweak our QA endpoint to fetch the variant. We make three simple changes:
Add a user argument to the endpoint: randomization happens based on a key (usually a user_id) so that we can ensure the same user sees a consistent experience throughout the experiment. Thus, the randomizer needs to know what user hits the endpoint.
We fetch the variant from the randomizer
We add the variant to the response, so that we can conveniently inspect the results
Note that we have not updated the completion code yet; first let's focus on making sure we integrated the SDK successfully. Test the endpoint either through the interactive docs or browsing directly to these URLs:
Make sure to try the same user a couple of times to verify that indeed the exact same variant shows up every time. In my case, Alice and Bob consistently see the GPT4 variant, while Charlie sees GPT3.5, but you will see different results. Furthermore, you should see the assignments log to your terminal as well:
We saved the best for last: using our variant to dynamically pick model version. This is now a simple change to our qa endpoint.
When "gpt" is in the variant name, we can supply the model variant directly to the openai_chat_completion function. Otherwise, simply return "42". Note that it is good practice to code defensively here. While the Eppo SDK is extremely robust, by checking whether the variant actually exists we make sure the code runs correctly even when there is an issue fetching the assignment from our SDK.
Using either the URLs above or the interactive documentation, we can now verify that the variant assigned to different users indeed leads to the API returning different endpoints.
Conclusion
At this point, you’ve configured the necessary assignment and flagging to run an online AB test between LLMs. We also talked about why this is a critical competency to have - “AB Experiment Infra is AI Infra”! Depending on your level of context before reading the demonstration, you may still be missing some critical components of running an experiment - like how to determine what metrics to measure, or analyze your results. Here are some recommended resources from the Eppo blog and docs to help you take your next step: