🐂 What is Oxen?
Oxen is a lightning fast data version control system for structured and unstructured machine learning datasets.
Oxen.AI’s interface mirrors git, but shines in many areas that git or git-lfs fall short. Oxen is built from the ground up for data, and is optimized to handle large datasets, and large files.
Oxen.AI’s comprised of a command line interface, as well as bindings for Rust 🦀, Python 🐍, and HTTP interfaces 🌎 to make it easy to integrate into your workflow.
✅ Features
Oxen is built around ergonomics, ease of use, and it is easy to learn. If you know how to use git, you know how to use Oxen.
Oxen Hub Features
-
🚀 Model Inference: No code model inference.
-
🏷️ Labeling Images: Edit any of your datasets straight from our UI.
-
📝 Text2SQL: Instant Text to SQL generation to ask your data questions.
-
🔍 Embeddings Search: Instant Text to SQL generation to ask your data questions.
Oxen Open Source Features
-
🔥 Fast: Efficient indexing and syncing of any dataset size (millions of images? no problem)
-
🌎 Workspaces: Interact with your data without downloading it
-
🧠 Intuitive: Same commands as git
-
💪 Handles large, unstructured files: images, videos, audio, text, parquet, arrow, json, models, etc
-
📊 Native DataFrame processing: index, compare and serve up DataFrames
-
📈 Versioning: Never worry about losing the state of your data
-
🤝 Distributed Collaboration: sync to an oxen-server
🌾 What kind of data?
Oxen.ai is designed to efficiently manage large datasets, including those with large individual files, for example CSV files with millions of rows. It also handles datasets comprising millions of individual files and directories such as the complete collection of ImageNet images.
The backend is agnostic to data type, so feel free to add any binary blobs. We automatically detect certain data types on upload so that we can render them within the UI. Specifically filetypes such as csv, tsv, jsonl, parquet, arrow turn into beautiful data tables. Images, audio, and video files will also play natively.
🚀 Built for speed
One of the main reasons datasets are hard to maintain is the pure performance of indexing the data and transferring the data over the network. We wanted to be able to index hundreds of thousands of images, videos, audio files, and text files in seconds.
Watch below as we version hundreds of thousands of images in seconds 🔥
But speed is only the beginning. Think of Oxen.ai as a set of building blocks to build your dream workflow on top of.
⚒️ Installation
⬇️ Cloning Datasets
The fastest way to get up and running with oxen is by cloning a dataset. Explore the many public datasets we have today on the OxenHub.
⬆️ Pushing Datasets
Create and share your own repository to share your datasets with your team or the world by pushing them to OxenHub.
📚 Learn The Basics
There are many ways to use Oxen. You can use the command line interface, the python library, or the OxenHub web interface. Learn the basics of each below.
Command Line Interface
Learn how to use the Oxen command line interface
Python Library
Get started with the python library
Web Interface
Use the OxenHub web interface
Self Host
Host Oxen in your own infrastructure
🕵️ Explore Use Cases
See examples repositories for inspiration.
Computer Vision
Classify images, detect objects, semantic segmentation and more.
Natural Language Processing
Build chatbots, analyze sentiment, answer questions and more.
Audio
Classify audio, detect speakers, transcribe speech and more.
Generative AI
Generate images, text, music and more.
🌾 Why Build Oxen?
Oxen was build by a team of machine learning engineers, who have spent countless hours in their careers managing datasets. We have used many different tools, but none of them were as easy to use and as ergonomic as we would like.
If you have ever tried git lfs to version large datasets and became frustrated, we feel your pain. Solutions like git-lfs are too slow when it comes to the scale of data we need for machine learning.
If you have ever uploaded a large dataset of images, audio, video, or text to a cloud storage bucket with the name:
s3://data/images_july_2022_final_2_no_really_final.tar.gz
We built Oxen to be the tool we wish we had.
🤖 Built for AI
If you are building an AI application, data is the lifeblood. Data is constantly changing over time, and data differentiates your model from the competition.
Whether you are building your own model from scratch, fine-tuning a pre-trained model, or using a model as a service, you will need to manage and compare the inputs and outputs over time to ensure your model is improving.
We version our code, why not our data?
Versioning your data means you can experiment on models in parallel with different data. The more experiments you run, the smarter your model becomes, and more robust models lead to better products.
🐂 Why the name Oxen?
“Oxen” comes from the fact that we will plow, maintain, and version your data like a good farmer tends to their fields 🌾. During the agricultural revolution, the plow and offloading work to Oxen helped people specialize and start working on other important societal tasks. Let Oxen take care of the grunt work of your infrastructure so you can focus on the higher-level ML problems that matter to your product.