Oxen’s open source data version control system shines at workflows and data sizes where git or git-lfs fall short. The interface is inspired by git, so that it is easy to learn, but has a few core differences. Oxen is built from the ground up to handle large datasets with many files or large csvs, parquet files, or other large binary blobs like model weights.

Versioning 101

The first thing you need to know about Oxen.ai is that it has both remote and local workflows. Remote workflows allow you to add files directly to the remote without pulling any data locally. Say we wanted to add a file to a dataset like ImageNet with 1 Million Files, you do not want to wait to clone all the files locally just to add yours.

from oxen import RemoteRepo

# Connect your client
repo = RemoteRepo("my-username/my-repo")
# Upload the image
repo.add("images/image_1_000_001.png")
# Commit to the main branch
repo.commit("Adding the 1,000,001st image to the dataset")

This is just one example of how Oxen.ai enables a more developer friendly workflow for large datasets. There are also optimizations under the hood such as parallel file transfer, scalable merkle trees, and data deduplication to make Oxen go brrr (or mooo?).

Client and Server

The open source version control tools come a server to sync data to and a client that can interact with data locally and remotely. The client and server share a common core library that is written in Rust and is used to quickly sync data between the two.

The server exposes a REST API that can be used to interact with data. Oxen.ai’s clients include a command line interface, as well as bindings for Rust 🦀, Python 🐍, and HTTP interfaces 🌎 to make it easy to integrate into your workflow.

Installation

Oxen makes versioning your datasets as easy as versioning your code. You can install through homebrew or pip or from our releases page.

brew tap Oxen-AI/oxen
brew install oxen

Remote vs Local Workflow

In the world of version control, there are two main paradigms: centralized and decentralized. Centralized version control systems allow you to have remote first workflows where you do not need to have a fully copy of the data on your local machine. Decentralized version control systems like git by default duplicate all the data to every node in your network.

While the decentralized nature of git makes it easy to maintain full copies of the history across many machines, this is not practical for large datasets. Oxen was designed from the ground up to be able to seamlessly switch between local (decentralized) and remote (centralized) workflows. Only clone what you need, and contribute back to the remote repository when you are done.

Remote Workflow

To get started with the remote workflow, you need to setup an oxen-server. Oxen.ai provides both an open source server and a hosted solution that can be used to sync data between your local machine and the cloud. To try the hosted solution, you can create a free account at https://oxen.ai.

To learn how to setup the open source server, check out the server documentation.

Remote Repository

If a remote repository already exists, you simply have to pass in the namespace/name of the remote repository you want to connect to.

from oxen import RemoteRepo

repo = RemoteRepo("ox/CatDogBBox")

This is a cheap operation that just sets up the pointer to the remote repository. It does not download any data.

Create a Remote Repository

If you do not already have a remote repository, you can create one directly from Pyhton. You may want to start with an empty remote repository and add your data later.

from oxen import RemoteRepo

repo = RemoteRepo.create("my-user/my-repo-name")

By default we add a README.md file to the repository with an initial commit. If you want to create an empty repository without adding a README.md you can pass empty=True to the create method.

from oxen import RemoteRepo

repo = RemoteRepo.create("my-user/my-repo-name", empty=True)

The reason you may want to start with an empty repository is if you already started a local repository and want to push it to the remote repository. This local repository already has a commit history. When pushing to a remote, commit histories must match. Hence we need to start with an empty remote repository without any commits if we want to push a local repository with a commit history.

Add Files

You can add files to the remote repository by passing the path to the file and the destination directory. This will upload the file to the remote repository and stage it for commit.

Python
from oxen import RemoteRepo
repo = RemoteRepo("ox/CatDogBBox")
repo.add("images/000000002754.jpg", dst="images/")

Commit Changes

You can commit changes to the remote repository by passing a message.

Python
repo.commit("Adding the 1,000,001st image to the dataset")

File Exploration

To see the files in the remote repository you can use ls.

from oxen import RemoteRepo

repo = RemoteRepo("ox/CatDogBBox")
print(repo.ls())

To view a specific directory you can pass the directory name to the ls method.

Note: the directories are paginated so you will need to use the page_num parameter to view the next page of results. There are also total_pages, page_number, and total_entries attributes that give you information about the pagination.

from oxen import RemoteRepo

repo = RemoteRepo("ox/CatDogBBox")
images_results = repo.ls("images", page_num=1, page_size=10)
print(images_results)
print(images_results.total_pages)
print(images_results.page_number)
print(images_results.total_entries)

Downloading Data

You can download individual files and folders if you do not need the entire data repository for your job.

oxen download ox/CatDogBBox annotations/test.csv

Checkout a Branch

If you have a data on a separate branch that you want to view you can checkout a branch by passing the branch name to the checkout method.

Python
from oxen import RemoteRepo
repo = RemoteRepo("ox/CatDogBBox")
repo.checkout("my-branch-name")
print(repo.ls())

Create a New Branch

The checkout method also allows you to create a new branch if the branch does not exist.

Python
from oxen import RemoteRepo
repo = RemoteRepo("ox/CatDogBBox")
repo.checkout("my-new-branch-name", create=True)
print(repo.ls())

View Branches

To see all the branches in the remote repository you can use the branches method.

Python
from oxen import RemoteRepo
repo = RemoteRepo("ox/CatDogBBox")
print(repo.branches())

Workspaces

Under the hood, the way that we enable remote collaboration is through a concept called a workspace. A workspace can be thought of as a working copy of changes, that is stored on the remote server. Just like you can add files before committing locally, you can add files to a workspace on the remote server before committing. This allows you to build up a set of changes remotely before committing them in bulk.

from oxen import RemoteRepo
from oxen import Workspace

repo = RemoteRepo("ox/CatDogBBox")
workspace = Workspace(repo, "add-images")
workspace.add("/path/to/image.png")
status = workspace.status()
print(status.added_files())
workspace.commit("Adding the 1,000,001st image to the dataset")

The RemoteRepo.add method is a shortcut for creating a workspace and adding files to it. It creates a ephemeral workspace and adds the files to it, and deletes the workspace after committing.

To learn more about workspaces, check out the workspaces documentation.

Connect Local to Remote

Remote repositories are identified by a remote URL. This is the URL that you can use to clone the repository.

Python
from oxen import RemoteRepo

repo = RemoteRepo.create("my-user/my-repo-name", empty=True)
print(repo.url())

You can use this URL to clone the repository.

Python
# Local Repository
from oxen import Repo
from oxen import RemoteRepo

remote_repo = RemoteRepo.create("my-user/my-repo-name", empty=True)
repo_url = remote_repo.url()

local_repo = Repo("/path/to/local/repo")
local_repo.clone(repo_url)

Or you can set the remote of a local repository to the remote repository.

Python
from oxen import Repo
from oxen import RemoteRepo

remote_repo = RemoteRepo.create("my-user/my-repo-name", empty=True)
repo_url = remote_repo.url()

local_repo = Repo("/path/to/local/repo")
remote_repo.set_remote("origin", remote_repo.url())

Local Workflow

Local workflow looks a lot like git. The downside is that you have to duplicate all the data locally. The upside is that oxen is optimized to make local workflows fast.

Clone Dataset

Clone your first Oxen repository from the OxenHub.

oxen clone https://hub.oxen.ai/ox/CatDogBBox

Initialize User

Each change you make will be associated with a name and email. Set them before you get started so you know who changed what. The user data is saved by default in ~/.config/oxen/user_config.toml.

oxen config --name "Bessie Oxington" --email "bessie@yourcomany.com"

Create Repository

Initialize your first Oxen repository, and commit the first version of your data.

# Initialize the repository
oxen init
# Write data to a file
printf '%s\n' 'name,age' 'bob,12' 'jane,13' > people.csv
# Stage the data for commit
oxen add people.csv
# Commit the changes with a message
oxen commit -m "Adding my data"

Version Your Data

Once your data has been committed, you can always return to that version.

Confidently overwrite the file, move the file, delete the file, it doesn’t matter. Oxen will always have a copy of the data at the time of the previous commit.

Create Branch

It is good practice to create a new branch for changes you make to your data. This will allow you to easily compare the parallel versions of your data over time.

# Checkout a branch named `modify-data`
oxen checkout -b modify-data
# Overwrite data in existing file
printf '%s\n' 'name,age' 'bob,12' 'jane,13' 'joe,14' > people.csv

Delete Branch

Once finished with a branch, you can delete it.

# Checkout main branch locally
oxen checkout main
# Delete 'other_branch' locally
oxen branch -d new_branch # may need -D if branch is not merged into main
# Delete branch in remote repo
oxen push origin --delete new_branch

Diff Changes

View the change you made with the oxen diff command. This will show you the changes you made to your data since the last commit.

CLI
oxen diff image_classification_data.csv
Column changes:
   + label (str)

Row changes:
   Δ 1 (modified)
   + 3 (added)
   - 2 (removed)

shape: (6, 7)
+-------------+-----+-----+-------+--------+-------------+-------------------+
| file        | x   | y   | width | height | label.right | .oxen.diff.status |
| ---         | --- | --- | ---   | ---    | ---         | ---               |
| str         | i64 | i64 | i64   | i64    | str         | str               |
+-------------+-----+-----+-------+--------+-------------+-------------------+
| image_0.jpg | 0   | 0   | 10    | 10     | cat         | modified          |
| image_1.jpg | 1   | 2   | 10    | 20     | null        | removed           |
| image_1.jpg | 200 | 100 | 10    | 20     | dog         | added             |
| image_2.jpg | 4   | 10  | 20    | 20     | null        | removed           |
| image_3.jpg | 4   | 10  | 20    | 20     | dog         | added             |
| image_4.jpg | 10  | 10  | 10    | 10     | dog         | added             |
+-------------+-----+-----+-------+--------+-------------+-------------------+

Once you push you changes to OxenHub, you can view the changes you made in your commit history.

The diff command line tool is more powerful than it looks on the surface. Oxen has the ability to diff files of many formats, and the ability to specify keys are targets in tabular diffs to make it easier to see what changed.

For advanced usage, check out the full diff documentation.

Restore Changes

If you are not happy with the changes you made to your data, you can restore them to the previous commit with the oxen restore command.

oxen restore people.csv

Commit Changes

Once you are happy with the changes you have made to your data, you can commit them to the repository with a new message.

oxen add people.csv
oxen commit -m "Adding Joe to the dataset"

View History

To see the commit history of your repository, you can use the oxen log command.

oxen log

Checkout Main Branch

Once you are done making changes to your data, you can return to the main branch with the oxen checkout command.

Never fear, the file now has now been reverted to the inital commit again, but your changes will be saved in the branch you created.

oxen checkout main

List Branches

To see the branches in your repository, you can use the oxen branch command.

oxen branch

Push Data

Once your data has been committed locally, you can sync it to the OxenHub.

OxenHub is a free service that allows you to collaborate on your data in the cloud. You can create a free account at https://oxen.ai.

# Go create repo at https://oxen.ai
# ...
oxen config --set-remote origin https://hub.oxen.ai/<namespace>/<repo_name>
oxen config --auth hub.oxen.ai <your_auth_token>
oxen push origin main
# to push your other branch simply change the branch name from `main` to `modify-data`

Clone Data

Clone your data faster than ever before. Oxen has been optimized to the core to make pulling large datasets as fast as possible.

oxen clone https://hub.oxen.ai/ox/CatDogBBox

Pull Changes

Only pull the changes you need. Oxen will only pull the files that have changed since the last time you pulled.

oxen pull origin main