Core Principle

Oxen was designed from the ground up to be fast. Whether you have many small files, a few large files, or a mix of both, Oxen intelligently hashes, packages, and syncs the data as fast as an Ox physically can.

Food 101 Dataset

The Food 101 dataset has 100k images in many different sub directories. Here is the Food 101 Dataset on Oxen.ai.

~ TLDR ~

  • ✅ Oxen syncs all the images in about 3 minutes
  • 🦥 DVC backed by S3 took 16 minutes
  • 🦥 git+git lfs syncing GitHub took over an hour

🐂 Oxen

oxen add images # 12.90 secs
oxen commit -m "adding images # 34.77 secs
oxen push origin main # 150.22 secs

Total time or ~3 min to sync to Oxen.

Git + Git LFS

Compare this to a system like git lfs on the same dataset.

Git-LFS is also many more commands to keep track of in your head and easy to mess up.

git init
git lfs install
git lfs track "*.jpg"
git add .gitattributes
git add images # 132.82 secs
git commit -m "adding images"
git push origin main # 79.96 min

Total time pushing to hugging face: 82+ min

DVC + S3 Backend

DVC is built on top of git + an open source project and can be synced to S3 for storage.

You have to keep track of which commands are dvc and which are git, and the commands are not as intuitive as Oxen. It is easy to track the wrong things in your git repo.

git init
dvc init
dvc add images/ # Executed in  249.16 secs
git add images.dvc .gitignore
git commit -m "adding images"
git remote add origin https://github.com/owner/repository.git
dvc remote add --default datastore s3://my-bucket
git push origin main
dvc push # Executed in  719.79 secs

Total: 968.95 = 16 min

aws s3 cp

NOTE: This test was on CelebA dataset with 200k images, so not apples to apples with the ones above. We did the same test in oxen and it took ~6 minutes.

You may currently be storing your training data in AWS s3 buckets. Even is slower than syncing to Oxen. Not to mention it lacks other features you gain with Oxen.

The AWS S3 tool syncs each image sequentially and takes about 38 minutes to complete. Oxen optimizes the file transfer, compresses the data, and has a 5-10x performance improvement depending on your network and compute.

time aws s3 cp images/ s3://testing-celeba --recursive
________________________________________________________
Executed in   38.87 mins