Oxen makes it feel like your dataset is local, even when it is not downloaded to the machine you are on.

The default oxen clone will clone the latest commit data on the default branch, but this is not desirable for all use cases. It can be expensive depending on the size of your data repository, and you may just be doing exploratory data analysis or only need a subset of the data.

Interacting with the Remote

There is a --shallow flag on oxen clone which only pulls the necessary metadata, but does not pull any actual files.

oxen clone https://hub.oxen.ai/ox/CatDogBBox --shallow
cd CatDogBBox

If you do a quick ls on the command line you will see that there are no files locally. Never fear, we are in a shallow state and can still interact with the repo remotely.

Data Exploration

To view the remote files you can use oxen remote ls which lists the files, as well as giving a summary of their data types.

oxen remote ls

To view a specific directory you can use oxen remote ls <directory>. Note: the directories are paginated so you will need to use the --page flag to view the next page of results. The python library also gives access to pagination parameters.

oxen remote ls images/

Downloading Data

You can even download subsets of the data in the event you do not need the entire data repository for your job. You may even just want to do some intial exploration of the data before you decide to clone the entire repository.

oxen remote download images/000000002754.jpg

Exploring Data Frames

Data Frames can be explored straight from the command line or downloaded into your local python environment to do additional exploration. The same parameters for exploring local data frames apply to remote data frames, and can be found in the data frame documentation.

# Note that the df is not downloaded locally
oxen remote df annotations/train.csv

You can think of the remaining oxen remote subcommands similar to the standard oxen but in your remote workspace. For example, you can run oxen remote status to see the status of your remote workspace.

oxen remote status

Adding Files

When adding data, it is always a good idea to create a branch for the changes you are about to make. This will allow you to make changes without affecting the default branch.

Creating a Branch

oxen checkout -b add-images
oxen push origin add-images

Staging Files

To add raw files to the remote workspace you can use oxen remote add <file>. This will add the file to the remote workspace, but will not commit the changes.

Each user has their own independent remote workspace. This means that you can add files to your remote workspace without affecting other users on the branch, and that when you commit, it will only commit the subset of data within your remote staging area.

oxen remote add image.png
oxen remote status

Removing Remote Staged Files

If you accidentally add file from the remote workspace and want to remove it, no worries, you can unstage it with oxen remote rm --staged.

$ oxen remote rm --staged my-images/image.jpg

Commit Staged Files

When you are confident in the changes you have made, you can commit the changes to the remote workspace. This will create a new commit on the remote branch.

oxen remote commit -m "adding an image"

🎉 You have now committed data to the remote branch without cloning the full repo.

Note: If the remote branch cannot be merged cleanly, the remote commit will fail, and you will have to resolve the merge conflicts with some more advanced commands which we will cover later.

Remote Log

To see a list of remote commits on the branch you can use remote log. Your latest commit will be at the top of this list.

oxen remote log

Appending To Data Frames

Commonly, you will want to tie some sort of annotation to your files. For example, you might want to label an image with a bounding box, or a video with a bounding box and a class label.

Oxen has native support for extending and managing structured Data Frames in the form of csv, jsonl, or parquet files. To interact with these files remotely you can use the oxen remote df command.

We will be focusing on adding data to these files, but you can also use the oxen remote df command to view the contents of a DataFrame with all the same parameters locally. To learn more about the oxen remote df command, check out the data frame documentation.

TODO: add python documentation

oxen remote df annotations/train.csv
Full shape: (9000, 6)

Slice shape: (10, 6)
+-------------------------+-------+--------+--------+--------+--------┐
| file                    ┆ label ┆ min_x  ┆ min_y  ┆ width  ┆ height |
| ---                     ┆ ---   ┆ ---    ┆ ---    ┆ ---    ┆ ---    |
| str                     ┆ str   ┆ f64    ┆ f64    ┆ f64    ┆ f64    |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000128154.jpg ┆ cat0.019.27130.79129.58 |
| images/000000544590.jpg ┆ cat9.7513.49214.25188.35 |
| images/000000000581.jpg ┆ dog   ┆ 49.3767.7974.29116.08 |
| images/000000236841.jpg ┆ cat115.2196.6593.8742.29  |
| …                       ┆ …     ┆ …      ┆ …      ┆ …      ┆ …      |
| images/000000201969.jpg ┆ dog   ┆ 167.2473.9937.064.94  |
| images/000000201969.jpg ┆ dog   ┆ 110.8183.8718.0238.95  |
| images/000000201969.jpg ┆ dog   ┆ 157.04133.6338.6318.55  |
| images/000000201969.jpg ┆ dog   ┆ 97.72110.235.971.11  |
+-------------------------+-------+--------+--------+--------+--------+

Say you want to add a bounding box annotation to this dataframe without cloning it locally. You can use the --add-row flag on the oxen remote df command to remotely stage a row on the DataFrame.

oxen remote df annotations/train.csv --add-row "my-images/image.jpg,dog,100,100,200,200"
shape: (1, 7)
+----------------------------------+---------------------+-------+-------+-------+-------+--------┐
| _id                              ┆ file                ┆ label ┆ min_x ┆ min_y ┆ width ┆ height |
| ---                              ┆ ---                 ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---    |
| str                              ┆ str                 ┆ str   ┆ f64   ┆ f64   ┆ f64   ┆ f64    |
+----------------------------------+---------------------+-------+-------+-------+-------+--------+
| d2b9dc18c7e50eebd0541e98f8b83efa ┆ my-images/image.jpg ┆ dog   ┆ 100.0100.0200.0200.0  |
+----------------------------------+---------------------+-------+-------+-------+-------+--------+

This returns a unique ID for the row that we can use as a handle to interact with the specific row in the remote workspace. To list the added rows on the dataframe you can use the oxen remote diff command.

oxen remote diff annotations/train.csv
Added Rows

shape: (2, 7)
+----------------------------------+----------------------+-------+-------+-------+-------+--------┐
| _id                              ┆ file                 ┆ label ┆ min_x ┆ min_y ┆ width ┆ height |
| ---                              ┆ ---                  ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---    |
| str                              ┆ str                  ┆ str   ┆ f64   ┆ f64   ┆ f64   ┆ f64    |
+----------------------------------+----------------------+-------+-------+-------+-------+--------+
| 822ac1facbd79444f1f33a2a0b2f909d ┆ my-images/image2.jpg ┆ dog   ┆ 100.0100.0200.0200.0  |
| ab8e28d66d21934f35efcb9af7ce866f ┆ my-images/image3.jpg ┆ dog   ┆ 100.0100.0200.0200.0  |
+----------------------------------+----------------------+-------+-------+-------+-------+--------+

If you want to delete a staged row, you can delete it with the --delete-row flag and the value in the _id column.

oxen remote df annotations/train.csv --delete-row 822ac1facbd79444f1f33a2a0b2f909d

To clear all staged rows, you can use the restore subcommand to restore the file.

oxen remote restore --staged annotations/train.csv

Once you have a set of rows you want to commit on your data frame, simply call oxen remote commit again.

oxen remote commit -m "adding an image"