๐ฅ Performance
๐ผ๏ธ 1 Million Files Benchmark
When we first started working on Oxen.ai, we were inspired by making a tool that would make it easy to collaborate on large datasets that power modern AI research.
One dataset that comes to mind is the original ImageNet dataset. This dataset spans 1000 object classes and contains > 1,000,000 training images and 100,000 test images. It commonly gets shared as a tarball, zip file, or gets dumped to S3 without much visibility into the data itself.
A version control system (VCS) would be a much better way to share and iterate on datasets like ImageNet. This is an example of a dataset that hasnโt been updated since itโs initial release. Backing the dataset with a VCS would allow people to collaborate on the dataset without duplicating data all over the place.
In order to do this effectively, the VCS needs to be fast to make the developer experience worth using. Not an easy task, but one we were willing to plow through at Oxen.ai ๐
๐ The Raw Numbers
To create this benchmark, we took the 1 million+ images from ImageNet and added them to Oxen, DVC, Git-LFS, and S3. The total time is to get the files from A (local filesystem) to B (remote storage) successfully. The steps to reproduce and the machine specs are in the sections below.
Here are the results in ranked order from fastest to slowest.
Tool | Time | Can view data? |
---|---|---|
๐ Oxen.ai | 1 hour and 30 mins | โ Yes |
Tarball + S3 | 2 hours 21 mins | โ No |
aws s3 cp | 2 hours 48 mins | โ No |
DVC + Local | 3 hours | โ No |
DVC + S3 | 4 hours and 51 mins | โ Yes w/ Other Tools |
Git-LFS | 20 hours | โ No |
Notice that Oxen is faster than even the laziest of methods, creating a tarball and uploading it to S3, but with the benefits of being able to view, query, and compare versions of the data. If you would like us to add any other tools to the benchmark, please let us know!
โ๏ธ Hardware and Network
All of the benchmarks were executed on a t3.2xlarge
EC2 instance with 4 vCPUs
and 16.0 GB of RAM
and a 1TB EBS
volume attached. We found that the size of the EBS volume did impact the IOPs for adding and committing data for all tools. All of the network transfer was within us-west-1 within AWS to S3.
๐ View the Data
One of the other advantages of using Oxen.ai, besides raw speed, is that you can view, query and collaborate on the data as soon as youโve pushed it to the web hub. Feel free to explore the end result here in Oxen.ai.
๐ง Why not Git?
Everybody knows and loves Git. But we also know that it isnโt exactly suited to version data. Trying to add multi-gigabyte datasets can quickly blowup storage costs and cause serious slowdown. And that isnโt really Gitโs purpose, either - GitHub, for instance, doesnโt even accept files larger than 100 megabytes.
Over the years, however, several attempts have been made to extend Git to gigabyte or even terabyte scale. In 2015 Git-LFS support was added to GitHub, which speeds up pulls by downloading files lazily, replacing tracked files with pointers and retrieving their content upon checkout. Data Version Control (DVC) came out in 2017, employing a similar concept but storing the file contents externally to Git.
In theory it sounds great to tie your VCS to the most popular version control system in the world in git. But in practice, it is a bit like trying to fill a swimming pool with a straw. You can do it, but you are tied to the limitations of the git protocols.
๐ How does Oxen.ai work?
With Oxen.ai, we take a different approach. Rather than trying to extend Git, we built Oxen, taking inspiration from Git where we can. We didnโt want to make you learn a completely new tool. If you know how to use git, you know how to use Oxen. But we also designed Oxen specifically to make versioning large amounts of data as fast as possible. Under the hood, Oxen uses Merkle trees, smart network protocols and fast hashing algorithms to reduce the amount of data our repositories store.
Unbound by Git, however, weโre also able to employ several other optimizations that make Oxen fast such as block-level deduplication, compression, iterating on subtrees, and more. Some of these optimizations are still under development, but weโre excited to share what we have so far, and you can find a deeper dive and list of the upcoming features here.
All of the code is open source and available on GitHub. We appreciate any feedback you have and welcome any stars and contributions!
๐ Running the Experiments
To give you a sense of the process as well as point out the advantages & challenges associated with each method, we ran the following experiments below, listed from slowest to fastest.
Git + LFS (~20 hours)
Git-LFS is a popular first tool to try since it is already in the Git ecosystem. The problem is that it is painfully slow when it comes to adding, committing, and pushing non-text files. It can also be a bit annoying to remember which files are tracked under LFS vs just regular Git. Many times have I accidentally committed a multi-GB file to git and wondered why my push was taking so long. Removing files from the git merkle tree is a whole other pain.
Steps to reproduce:
Total time: 20+ hours
Adding and committing data locally is not terribly slow (still slower than Oxen). But it does have to hash and copy every file into the hidden .git
directory. The combination of using a slow hashing algorithm and copying large files makes git-lfs slower than it has to be on add
and commit
.
The real killer here though is the push ๐ฅฑ. Pushing data to the remote takes over 20 hours in the case of ImageNet, even on the same network as our other tests.
DVC + S3 Backend (~5 hours)
DVC is a popular tool, tightly integrated with the Git ecosystem and can be configured for multiple storage backends. Youโll see that you have to toggle back and forth between DVC and git with 11 commands to remember and execute. It is easy to make a mistake and track the wrong things in your git repo as well as simply wrap your head around the fact that you are using two different tools to version your data.
Steps to reproduce:
Total Time: 4 hours and 51 mins
As you can see, DVC is not as slow as Git-LFS, but it is significantly more commands to remember and execute.
DVC + Local Storage Backend (~3 hours)
We wanted to do another test with DVC without any network transfer, purely to test the protocol overhead. Transferring to S3 may not be the best apples to apples comparison, since Oxen also compresses and deduplicates data on the network transfer.
Total Time: 3 hours
As weโll see below, Oxen is faster than DVC even if you drop the overhead of network transfer.
Tarball + S3 (~2 hours 21 mins)
I like to call this one, โFโ it, letโs just create a tarball and upload it to S3โ. Easy to remember, easy to use, but not very efficient nor effective when it comes to iterating on data.
Total Time: 2 hours 21 mins
This may work well for cold storage of data you may rarely want to view again. But for anything else, Oxen is a much better tool.
Oxen smartly compresses and creates smaller data chunks behind the scenes while transferring your data across the network, taking advantage of the network bandwidth and reducing the amount of time it takes to upload and download data.
aws s3 cp (~2 hours 48 mins)
You may be asking yourself, well if the tarball takes so long to create, why not just use the aws s3 cp
command with the --recursive
flag?
Total Time: 2 hours 48 mins
This is a bit slower overall than the tarball method, and you still have the same problems of iterating on and viewing the data. By looking at the logs, it looks like the s3 sdk is syncing the files one by one, which accounts for the slowness.
Oxen.ai (~1 hour and 30 mins)
With Oxen, if you know how to use git, there are no extra commands to remember. With the same commands as plain old git you can initialize, add, commit, and push your data to the remote.
Steps to reproduce:
Total Time ๐ฅ: 1 hour and 30 mins
If you are curious how Oxen works under the hood, we are working on a detailed technical writeup that dives into the Merkle tree, block-level deduplication, and more here.
Try Oxen.ai for Yourself
If you would like to try Oxen.ai for yourself, you can sign up for a free account here. All of the code is open source and available on GitHub. Let us know what you think by joining our Discord.