Products

Experimentation

Product Experimentation Web Experimentation Lifecycle Experimentation Lifecycle Experimentation

Feature Flagging

Release Management Automated Rollouts Config Flags Release Management

AI Personalization

Contextual Bandits Contextual Bandits

Why Eppo

WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

Blog

About

A/B Testing

September 12, 2023

How to Compare LLMs with an A/B Test

Why experiments are necessary to evaluate LLMs - and how you can easily A/B test between various models with Eppo.

Sven Schmit

Eppo's Head of Statistics Engineering and fmr Data Science Manager at Stitch Fix. Sven holds a PhD in statistics from Stanford University.

GPT-4, Llama 2, Claude, Orca… whether you’re looking to answer customer support tickets, generate content, sort through resumes, or query a large volume of data, the scope of business applications is growing steadily as LLMs advance in capabilities. Of course, choosing between the many LLM options, both commercial and open-source, is a problem requiring plenty of forethought — and experimentation. There are even differences between versions of a single model depending on use case — GPT-4 has a world of new capabilities vs. its predecessor GPT-3.5, but its slower speed and higher cost may make it a less optimal fit for certain use cases.

The need to A/B test different models, then, is obvious. If we care about the success and impact of a chosen model on business outcomes (as measured by defined metrics), an A/B test is our gold-standard tool for isolating the causal impact of different LLMs from other external factors. Offline evaluation all too often fails to correlate with our actual target business metrics (and can be prone to overfitting), so running online experiments in production becomes the necessity. (You can read more on why machine learning models cannot be evaluated without AB experiments here).

In this demonstration, I’ll provide a simple end-to-end example of how to run an experiment comparing LLMs using Eppo. Let’s imagine we want to compare the success of GPT-3.5 and GPT-4 for our hypothetical use case.

We will be using:

FastAPI to create a simple webserver
Eppo's Python SDK to fetch feature flag configuration
OpenAI API to answer questions

If you want to see the end result of the experiment being configured, I’ve also added it to Eppo’s preview app so you can click around and explore for yourself. (First get access to the preview app on the Eppo homepage, and then you can navigate directly to the tutorial example here)

‍

Getting started

Follow along in the Github repository where this demonstration lives.

‍

Creating a QA endpoint

‍

Integrating the OpenAI API

Let's make things a bit more interesting by integrating with the OpenAI API to help us answer questions. Make sure you have signed up for the OpenAI API and have an OpenAI API key.

Test the endpoint again and verify that the answer to the ultimate question of life got slightly longer.

‍

Creating an experiment

We now have a new way to answer questions, but how can we verify that our users prefer this option? Of course, we can run an experiment: we randomize users into different variations and then analyze the outcomes on key metrics we care about. Let's integrate the Eppo feature flagging SDK to make this trivial.

Creating an API key

‍

Setting up a feature flag

GPT3.5
GPT4
Fixed (for always responding with the answer 42)

Then, create an allocation, which determines how the flag gets triggered. For an experiment, we might want to randomize users into each of the 3 variants with equal probability

Finally, turn on the flag in the testing environment so that we can evaluate it in our app. Your setup should now look like

‍

Adding the feature flag to the app

to our imports.

In practice, you want to set up the logger to write the assignments to an assignment table in your data warehouse so that we can analyze the outcomes of the experiment by joining these assignments to the metrics we care about. But for now, printing the assignment is a useful placeholder to make sure everything works locally.

Next, let's tweak our QA endpoint to fetch the variant. We make three simple changes:

Add a user argument to the endpoint: randomization happens based on a key (usually a user_id) so that we can ensure the same user sees a consistent experience throughout the experiment. Thus, the randomizer needs to know what user hits the endpoint.
We fetch the variant from the randomizer
We add the variant to the response, so that we can conveniently inspect the results

Note that we have not updated the completion code yet; first let's focus on making sure we integrated the SDK successfully. Test the endpoint either through the interactive docs or browsing directly to these URLs:

‍

Putting the flag to action

Using either the URLs above or the interactive documentation, we can now verify that the variant assigned to different users indeed leads to the API returning different endpoints.

‍

Conclusion

At this point, you’ve configured the necessary assignment and flagging to run an online AB test between LLMs. We also talked about why this is a critical competency to have - “AB Experiment Infra is AI Infra”! Depending on your level of context before reading the demonstration, you may still be missing some critical components of running an experiment - like how to determine what metrics to measure, or analyze your results. Here are some recommended resources from the Eppo blog and docs to help you take your next step: