๐งช Generate Synthetic Datasets
Build and version a synthetic dataset to train a model on
It can be expensive and time-consuming to collect or label data. Synthetic data can either augment your existing data, help filter down a data distribution, or generate a completely new dataset. Be careful, the data generated is not always 100% accurate, but can give you a good jumping off point. You should always validate and version the data you generate, so that you can track changes and roll back if the data is not what you expected.
Follow along with the example notebook by running it in your own Oxen.ai account.
Setting Up the Dataset
In this case, we will be constructing a synthetic dataset of customer support conversations. Letโs assume you have no data to start with. As long as you know the types of problems your customers are having, you can generate a starting dataset of fake names, roles, problems, and experience levels.
We will be using the faker
library to generate a starting dataset. We can then run this dataset through an LLM to generate prompts and responses as if they were both the customer and the support agent. Faker
has a lot of built in functionality, such as the ability to generate fake names, addresses, phone numbers, and more.
We will be extending it using the DynamicProvider
interface to create our own.
You can now call these new providers to generate data.
Each row of our dataset will now have a uuid, name, role, problem, descriptor, and experience.
Now we can generate a starting dataset of 100 rows.
Versioning the Dataset
Now that we have our starting dataset, we should version it, so that we can play around with different prompts and models. You can use the upload
method to upload a file to your repository with a commit message.
Once the data is uploaded, you can view and query the generated dataset from Oxen.aiโs Dataset UI.
Running an LLM
If you want to try out different prompts and models without writing any code, you can use Oxen.aiโs Model Inference feature. Click the ๐ button on the right of the screen to open the inference UI.
In the example above, we are using DeepSeek-v3 to generate synthetic customer questions about an iPhone with the following prompt:
By default, Oxen.ai samples 5 rows from the dataset, so that you can get a sense of how well the model is performing. You will also see an estimated price for how much the inference will cost over the entire dataset.
Once you feel confident in the sample results, you can run the inference on the entire dataset by clicking the Next ->
button. This will allow you to pick an output branch, file, and write a commit message once the run is complete.
Sit back, grab a coffee โ๏ธ and Oxen.ai will run the inference in the background.
Once the inference is complete, you can view and share the results with your team ๐
You can run the same process again with a new prompt to generate all the responses to the synthetic questions, but we will leave this as an exercise for the reader ๐ค happy generating!