Here we will show you how to quickly fine-tune Llama 3.2 (3B) on a general education dataset of 3,000 examples.
We would use the Model Evals tool again. Go through the same process of opening the dataset and clicking the “Model Inference” button. This time, choose a different model, write a prompt explaining it’s judging the quality of the responses, and pass in the prompt and response column. We’re going to be using GPT-4o mini with the prompt:
Taking time to specify what you are looking for is important. Telling the model the exact criteria for what is good or bad will give you more accurate evaluations and control over the model accuracy. It’s also best practice to use a model from a different provider to evaluate the quality of the base model’s responses, since LLMs have been found to prefer their own responses even if the responses aren’t the best.