Comparing Models
Oxen.ai helps you compare results from your machine learning models.
There’s always a reason to do a model comparison - whether a new model/finetune drops, a new methodology for prompting comes out, or it’s time to evaluate an in-house model. Oxen’s diff tool allows you to evaluate and compare model outputs from small tests all the way to large benchmarks.
To follow along with this example, we’ll be using data from the BoolQ Repo which was generated with this notebook.
Our Data
In this repo, we’re comparing the outputs of the Gemma-2b-Instruct model and Llama-7b-chat-hf model on the Boolq Benchmark.
Let’s check out the structure of these datasets:
Each dataframe has columns for the index, the context for the question, the prompt, and the ground truth label (validation_response
).
We can see here that our models didn’t output exactly “True” or “False” like they were told to. So we added a column processed_response
to show a clean difference between the outputs.
Comparing Model Results
But we mainly care about how these models do compared to each other. So we basically want to know where the processed_response
’s are different in each file.
View Results in Oxen UI
These results are also available in the Oxen UI, which makes it a bit easier to grok what’s going on than the command line.
You can view the results in the UI by going to the compare tab in this repository.
From this, we can see that out of the 3270 total samples, our models disagreed on 360 total samples, or roughly 11% of the dataset.
In some cases, like line 25, the model on the right (llama_chat
in this case) didn’t really provide an answer, as it responded with both “True and False”.
Takeaways
Some potential takeaways are:
- Gemma-2b was better at following these instructions (text formating) than Llama-7b despite its smaller size.
- These models were fairly in agreement on the validation set without any finetuning on the training set.
- Gemma-2b is a candidate to replace Llama-7b-chat as a base model for this task, however we will need to further explore to confirm.
Next Steps
We will use the oxen diff tool to dive deeper into these results, comparing accuracies. We will also further explore the trends in these differences and how to use Oxen to take the next steps in our data science workflow.