It can be expensive and time-consuming to collect or label data. Synthetic Data Generation is the process of using a strong model to generate data for you, skipping a lot of manual labor.
Say you wanted to predict the total amount from an invoice, but don’t have any customer data yet. You could use an LLM to generate a dataset of 1000 invoices with random companies, products, dates and amounts. This can be used as a test set to validate different models, or split into a train/test set to train a new model.
Use a strong model such as Qwen2 VL 72B Instruct to caption a set of images in as much detail as possible. You then use the synthetic captioned image data to train another model to generate images.
In this tutorial, we will show you how we created a dataset of different roles in a company, a product they would use, and then a question and response about the product. This can be used for customer support, a chatbot, etc. We used Llama 3.1 70B and Hermes 3 70B to compare the quality of the synthetic data. Check out the Models Page to try different models.
Synthetic-Persona-Customer-Support
.
{uuid}
to give a little bit of randomness to the output.
While the evaluation is running you will see a progress bar showing how many rows have been completed, an update of how many tokens are being used, and how expensive the run is so far.
We also generated some synthetic data with Hermes 3 70B to compare the quality, you can see the results here. We found that the Llama 3.1 70B model was better overall. Hermes 3 70B was very chatty, just returned nothing at times, and, perhaps because of the length of the output, took extremely long (over 2:45 hours for response gen in comparison to 28 mins for Llama 3.1 70B) to complete.