Oxen.aiโ€™s shines at workflows and data sizes that git or git-lfs fall short. The interface is inspired by git, so that it is easy to learn, but has a few core differences. Oxen is built from the ground up to handle large datasets with many files or large csvs, parquet files, or other large binary blobs.

๐Ÿ’พ Oxen Versioning 101

The first thing you need to know about Oxen.ai is that allows to add files directly to the remote without pulling any data locally. If you want to make a contribution to a dataset with 1 Million Files, you do not want to wait to clone all the files locally just to add yours.

from oxen import RemoteRepo

# Connect your client
repo = RemoteRepo("my-username/my-repo")
# Upload the image
repo.add("images/image_1_000_001.png")
# Commit to the main branch
repo.commit("Adding the 1,000,001st image to the dataset")

This is just one example of how Oxen.ai enables a more developer friendly workflow for large datasets. There are also optimizations under the hood such as parallel file transfer, scalable merkle trees, and data deduplication to make Oxen go brrr (or mooo?).

Oxen.aiโ€™s comprised of a command line interface, as well as bindings for Rust ๐Ÿฆ€, Python ๐Ÿ, and HTTP interfaces ๐ŸŒŽ to make it easy to integrate into your workflow.

โœ… Features

Oxen.ai is built around ergonomics, ease of use, and it is easy to learn. If you know how to use git, you know how to use Oxen. The main difference is that data is a first class citizen in Oxen. The interface allows you to view, edit, query, and version your data as well as your code.

Web Hub Features

The Oxen Web Hub is an easy to use web interface that allows you to discover and explore datasets without having to download them. You can also upload files directly, add permissions, and explore branches directly in the interface.

  • ๐Ÿ”Ž Explore Datasets: View and query your data in a beautiful interface

  • ๐Ÿท๏ธ Labeling Workflows: Edit data from the UI, or build your own labeling workflows

  • ๐Ÿค Collaboration: Share datasets, models, and code with your team

  • ๐Ÿ“ Notebooks: Spin up a Python Notebook on a GPU in seconds

  • ๐Ÿš€ Model Inference: Run an LLM on your dataset simply by writing a prompt

Open Source Features

The tooling behind the Oxen.ai is open source and available on GitHub. This includes the powerful command line interface, oxen server, python library, and the HTTP API. The Oxen.ai web interface is built on top of the open source tooling, and can be deployed on your own infrastructure as a part of our enterprise offering.

  • ๐Ÿ”ฅ Fast: Efficient indexing and syncing of any dataset size (millions of images? no problem)

  • ๐ŸŒŽ Workspaces: Interact with your data without downloading it

  • ๐Ÿง  Intuitive: Same commands as git

  • ๐Ÿ’ช Handles large, unstructured files: images, videos, audio, text, parquet, arrow, json, models, etc

  • ๐Ÿ“Š Native DataFrame processing: index, compare and serve up DataFrames

  • ๐Ÿ“ˆ Versioning: Never worry about losing the state of your data

  • ๐Ÿค Distributed Collaboration: sync to an oxen-server

๐ŸŒพ What kind of data?

Oxen.ai is designed to efficiently manage large datasets, including those with large individual files, for example CSV files with millions of rows. It also handles datasets comprising millions of individual files and directories such as the complete collection of ImageNet images.

The backend is agnostic to data type, so feel free to add any binary blobs. We automatically detect certain data types on upload so that we can render them within the UI. Specifically filetypes such as csv, tsv, jsonl, parquet, arrow turn into beautiful data tables. Images, audio, and video files will also play natively.

๐Ÿš€ Built for speed

One of the main reasons datasets are hard to maintain is the pure performance of indexing the data and transferring the data over the network. We wanted to be able to index hundreds of thousands of images, videos, audio files, and text files in seconds.

Watch below as we version hundreds of thousands of images in seconds ๐Ÿ”ฅ

But speed is only the beginning. Think of Oxen.ai as a set of building blocks to build your dream workflow on top of.

โš’๏ธ Installation

brew tap Oxen-AI/oxen
brew install oxen

โฌ‡๏ธ Cloning Datasets

The fastest way to get up and running with oxen is by cloning a dataset. Explore the many public datasets we have today on the OxenHub.

oxen clone https://hub.oxen.ai/ox/CatDogBBox

โฌ†๏ธ Pushing Datasets

Create and share your own repository to share your datasets with your team or the world by pushing them to OxenHub.

# New repos can also be create in UI
oxen add .
oxen commit -m "message"
oxen create-remote --name <namespace>/<repo_name>
oxen config --set-remote origin https://hub.oxen.ai/<namespace>/<repo_name>
oxen push origin main

๐Ÿ“š Learn The Basics

There are many ways to use Oxen. You can use the command line interface, the python library, or the OxenHub web interface. Learn the basics of each below.

๐Ÿ•ต๏ธ Explore Use Cases

See examples repositories for inspiration.

๐ŸŒพ Why Build Oxen?

Oxen was build by a team of machine learning engineers, who have spent countless hours in their careers managing datasets. We have used many different tools, but none of them were as easy to use and as ergonomic as we would like.

If you have ever tried git lfs to version large datasets and became frustrated, we feel your pain. Solutions like git-lfs are too slow when it comes to the scale of data we need for machine learning.

If you have ever uploaded a large dataset of images, audio, video, or text to a cloud storage bucket with the name:

s3://data/images_july_2022_final_2_no_really_final.tar.gz

We built Oxen to be the tool we wish we had.

๐Ÿค– Built for AI

If you are building an AI application, data is the lifeblood. Data is constantly changing over time, and data differentiates your model from the competition.

Whether you are building your own model from scratch, fine-tuning a pre-trained model, or using a model as a service, you will need to manage and compare the inputs and outputs over time to ensure your model is improving.

We version our code, why not our data?

Versioning your data means you can experiment on models in parallel with different data. The more experiments you run, the smarter your model becomes, and more robust models lead to better products.

๐Ÿ‚ Why the name Oxen?

โ€œOxenโ€ comes from the fact that we will plow, maintain, and version your data like a good farmer tends to their fields ๐ŸŒพ. During the agricultural revolution, the plow and offloading work to Oxen helped people specialize and start working on other important societal tasks. Let Oxen take care of the grunt work of your infrastructure so you can focus on the higher-level ML problems that matter to your product.