🤖 Synthetic Data Generation
It can be expensive and time-consuming to collect or label data. Synthetic Data Generation is the process of using a strong model to generate data for you, skipping a lot of manual labor.
Synthetic Data can either augment your existing data, help filter down a data distribution, or generate a completely new dataset. Be careful, the data generated is not always 100% accurate, but can give you a good starting point. You should always validate the data you generate.
Examples
Synthetic Data can be used for a variety of tasks, here are a few examples:
Synthetic Invoice Data 🧾
Say you wanted to predict the total amount from an invoice, but don’t have any customer data yet. You could use an LLM to generate a dataset of 1000 invoices with random companies, products, dates and amounts. This can be used as a test set to validate different models, or split into a train/test set to train a new model.
Image Captioning + Generation 🖼️
Use a strong model such as Qwen2 VL 72B Instruct to caption a set of images in as much detail as possible. You then use the synthetic captioned image data to train another model to generate images.
Synthetic Persona Customer Support 🤖
In this tutorial, we will show you how we created a dataset of different roles in a company, a product they would use, and then a question and response about the product. This can be used for customer support, a chatbot, etc.
We used Llama 3.1 70B and Hermes 3 70B to compare the quality of the synthetic data. Check out the Models Page to try different models.
Get the Data
First download the Synthetic Persona Customer Support dataset as a starting point. This dataset is just a set of random UUIDs and roles at a company that was generated by ChatGPT.
Create a Repository
Once you have the starting point data, create a new repository and name it Synthetic-Persona-Customer-Support
.
Then upload the starting_point.jsonl file with a commit message.
Run a Model
Open the file you just uploaded and press the glowing button with the rocket 🚀 on it at the top right of the screen.
You will now find Oxen’s model evaluation feature. This is where you can choose the evaluation type, a model, and name the output column.
We’re starting with Meta’s Llama 3.1 70B and generating a product for each role. We are passing the roles and uuid in the prompt. Give the evaluation a name and run a quick sample on a few rows to see if the prompt works.
Note: We are using the {uuid}
to give a little bit of randomness to the output.
Select File Destination
After tweaking your prompt and getting a sample you like, click “Next” to decide the destination of the finished evaluation. Decide the target branch, target path, and if you would like to commit instantly or after reviewing the outcome. Once you’ve decided, click “Run Evaluation”.
Monitor Your Evaluation
Feel free to grab a coffee ☕️ close the tab, or do something else while the evaluation is running. Your trusty Oxen Herd will be running in the background.
While the evaluation is running you will see a progress bar showing how many rows have been completed, an update of how many tokens are being used, and how expensive the run is so far.
Prompt & Repeat
After generating the product, we then generated a question and response about each product with the same process. Just click “View file at commit” to see the updated dataset.
Then repeat the whole process for the question and response generation with these prompts:
Generate Questions
Generate Responses
You’ll then get a full dataset ready for fine-tuning, training, or anything you’d like.
Next Steps
Once done, you will see your new dataset committed to the branch you specified. If you don’t like the results, don’t worry! Under the hood, all the runs are versioned so you can always revert to or compare to a previous version.
We also generated some synthetic data with Hermes 3 70B to compare the quality, you can see the results here. We found that the Llama 3.1 70B model was better overall. Hermes 3 70B was very chatty, just returned nothing at times, and, perhaps because of the length of the output, took extremely long (over 2:45 hours for response gen in comparison to 28 mins for Llama 3.1 70B) to complete.
You can search through the outcomes with Oxen’s Text2SQL search. In this case, we looked for all the rows where the product column was MATLAB to compare the quality of the question and response.
You can also edit the SQL query to deepen your search.
You can also run queries such as “Give me all the rows with “Researcher” in the role column?” or “How many rows have “Jira” in the product column?”
Congratulations! You’ve just seen how easy it is to generate synthetic data. Let us know how your experience was and if you have any improvement requests.