How to train an LLM on your own data in a Marimo Notebook.
load_dataset
function from the oxen.datasets
library. This is a wrapper around the Hugging Face datasets library, and is an easy way to load datasets from the Oxen.ai hub. To have fine-tuning work well, it is a good idea to have at least ~1000-10000 unique examples in your dataset. If you can collect more, thatβs even better.
Donβt have a dataset yet? Checkout how to generate a synthetic dataset from a stronger model to bootstrap your own.
AutoModelForCausalLM
and AutoTokenizer
classes from the transformers
library.
predict
function with a sample question.
OxenExperiment
class that will handle creating a new branch, saving the model, and logging the results.
Branches are light weight in Oxen.ai, and by default will not be downloaded to your local machine when you do a clone. This means you can easily store model weights and other large assets on parallel branches and keep your main
branch small and manageable.
models
directory to see the model weights and other assets.
OxenTrainerCallback
that will be called during training to save the model weights and our metrics. This is a subclass of the TrainerCallback
class from the transformers
library, which can be passed into our training loop.
TrainerCallback
class, we implement the on_save
and on_log
methods. The on_save
method is called when the model is saved to disk, and the on_log
method is called when the model is trained on a batch, reporting loss and other useful metrics.
The most important concepts here are the Workspace
and DataFrame
objects from the oxenai
library. The Workspace
is a wrapper around the branch that we are currently on. This allows us to write data to the remote branch without committing the changes to the branch. Think of it like your local repo of unstaged changes, but for remote branches. To navigate to your workspaces, use the branch dropdown and then look at the active workspaces for a file.
Workspace
to write the temporary results, and then can commit the changes to the branch after training is complete.
on_log
method.
RemoteRepo
, model name, and output directory.
trl
library from Hugging Face is an easy to use library for training and fine-tuning models. We can use the SFTConfig
class to setup our training loop. This determines our batch size, learning rate, number of epochs, and other hyperparameters.