Tool | Time | Can view data? |
---|---|---|
π Oxen.ai | 1 hour and 30 mins | β Yes |
Tarball + S3 | 2 hours 21 mins | β No |
aws s3 cp | 2 hours 48 mins | β No |
DVC + Local | 3 hours | β No |
DVC + S3 | 4 hours and 51 mins | β Yes w/ Other Tools |
Git-LFS | 20 hours | β No |
t3.2xlarge
EC2 instance with 4 vCPUs
and 16.0 GB of RAM
and a 1TB EBS
volume attached. We found that the size of the EBS volume did impact the IOPs for adding and committing data for all tools. All of the network transfer was within us-west-1 within AWS to S3.
20+ hours
Adding and committing data locally is not terribly slow (still slower than Oxen). But it does have to hash and copy every file into the hidden .git
directory. The combination of using a slow hashing algorithm and copying large files makes git-lfs slower than it has to be on add
and commit
.
The real killer here though is the push π₯±. Pushing data to the remote takes over 20 hours in the case of ImageNet, even on the same network as our other tests.
4 hours and 51 mins
As you can see, DVC is not as slow as Git-LFS, but it is significantly more commands to remember and execute.
3 hours
As weβll see below, Oxen is faster than DVC even if you drop the overhead of network transfer.
2 hours 21 mins
This may work well for cold storage of data you may rarely want to view again. But for anything else, Oxen is a much better tool.
Oxen smartly compresses and creates smaller data chunks behind the scenes while transferring your data across the network, taking advantage of the network bandwidth and reducing the amount of time it takes to upload and download data.
aws s3 cp
command with the --recursive
flag?
2 hours 48 mins
This is a bit slower overall than the tarball method, and you still have the same problems of iterating on and viewing the data. By looking at the logs, it looks like the s3 sdk is syncing the files one by one, which accounts for the slowness.
1 hour and 30 mins
If you are curious how Oxen works under the hood, we are working on a detailed technical writeup that dives into the Merkle tree, block-level deduplication, and more here.