Remote Workspace
Oxen makes it feel like your dataset is local, even when it is not downloaded to the machine you are on.
The default oxen clone
will clone the latest commit data on the default branch, but this is not desirable for all use cases. It can be expensive depending on the size of your data repository, and you may just be doing exploratory data analysis or only need a subset of the data.
Interacting with the Remote
There is a --shallow
flag on oxen clone
which only pulls the necessary metadata, but does not pull any actual files.
oxen clone https://hub.oxen.ai/ox/CatDogBBox --shallow
cd CatDogBBox
If you do a quick ls
on the command line you will see that there are no files locally. Never fear, we are in a shallow state and can still interact with the repo remotely.
Data Exploration
To view the remote files you can use oxen remote ls
which lists the files, as well as giving a summary of their data types.
oxen remote ls
To view a specific directory you can use oxen remote ls <directory>
. Note: the directories are paginated so you will need to use the --page
flag to view the next page of results. The python library also gives access to pagination parameters.
oxen remote ls images/
Downloading Data
You can even download subsets of the data in the event you do not need the entire data repository for your job. You may even just want to do some intial exploration of the data before you decide to clone the entire repository.
oxen remote download images/000000002754.jpg
Exploring Data Frames
Data Frames can be explored straight from the command line or downloaded into your local python environment to do additional exploration. The same parameters for exploring local data frames apply to remote data frames, and can be found in the data frame documentation.
# Note that the df is not downloaded locally
oxen remote df annotations/train.csv
You can think of the remaining oxen remote
subcommands similar to the standard oxen
but in your remote workspace. For example, you can run oxen remote status
to see the status of your remote workspace.
oxen remote status
Adding Files
When adding data, it is always a good idea to create a branch for the changes you are about to make. This will allow you to make changes without affecting the default branch.
Creating a Branch
oxen checkout -b add-images
oxen push origin add-images
Staging Files
To add raw files to the remote workspace you can use oxen remote add <file>
. This will add the file to the remote workspace, but will not commit the changes.
Each user has their own independent remote workspace. This means that you can add files to your remote workspace without affecting other users on the branch, and that when you commit, it will only commit the subset of data within your remote staging area.
oxen remote add image.png
oxen remote status
Removing Remote Staged Files
If you accidentally add file from the remote workspace and want to remove it, no worries, you can unstage it with oxen remote rm --staged
.
$ oxen remote rm --staged my-images/image.jpg
Commit Staged Files
When you are confident in the changes you have made, you can commit the changes to the remote workspace. This will create a new commit on the remote branch.
oxen remote commit -m "adding an image"
🎉 You have now committed data to the remote branch without cloning the full repo.
Note: If the remote branch cannot be merged cleanly, the remote commit will fail, and you will have to resolve the merge conflicts with some more advanced commands which we will cover later.
Remote Log
To see a list of remote commits on the branch you can use remote log
. Your latest commit will be at the top of this list.
oxen remote log
Appending To Data Frames
Commonly, you will want to tie some sort of annotation to your files. For example, you might want to label an image with a bounding box, or a video with a bounding box and a class label.
Oxen has native support for extending and managing structured Data Frames in the form of csv, jsonl, or parquet files. To interact with these files remotely you can use the oxen remote df
command.
We will be focusing on adding data to these files, but you can also use the oxen remote df
command to view the contents of a DataFrame with all the same parameters locally. To learn more about the oxen remote df
command, check out the data frame documentation.
TODO: add python documentation
oxen remote df annotations/train.csv
Full shape: (9000, 6)
Slice shape: (10, 6)
+-------------------------+-------+--------+--------+--------+--------┐
| file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height |
| --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- |
| str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000128154.jpg ┆ cat ┆ 0.0 ┆ 19.27 ┆ 130.79 ┆ 129.58 |
| images/000000544590.jpg ┆ cat ┆ 9.75 ┆ 13.49 ┆ 214.25 ┆ 188.35 |
| images/000000000581.jpg ┆ dog ┆ 49.37 ┆ 67.79 ┆ 74.29 ┆ 116.08 |
| images/000000236841.jpg ┆ cat ┆ 115.21 ┆ 96.65 ┆ 93.87 ┆ 42.29 |
| … ┆ … ┆ … ┆ … ┆ … ┆ … |
| images/000000201969.jpg ┆ dog ┆ 167.24 ┆ 73.99 ┆ 37.0 ┆ 64.94 |
| images/000000201969.jpg ┆ dog ┆ 110.81 ┆ 83.87 ┆ 18.02 ┆ 38.95 |
| images/000000201969.jpg ┆ dog ┆ 157.04 ┆ 133.63 ┆ 38.63 ┆ 18.55 |
| images/000000201969.jpg ┆ dog ┆ 97.72 ┆ 110.2 ┆ 35.9 ┆ 71.11 |
+-------------------------+-------+--------+--------+--------+--------+
Say you want to add a bounding box annotation to this dataframe without cloning it locally. You can use the --add-row
flag on the oxen remote df
command to remotely stage a row on the DataFrame.
oxen remote df annotations/train.csv --add-row "my-images/image.jpg,dog,100,100,200,200"
shape: (1, 7)
+----------------------------------+---------------------+-------+-------+-------+-------+--------┐
| _id ┆ file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height |
| --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- |
| str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 |
+----------------------------------+---------------------+-------+-------+-------+-------+--------+
| d2b9dc18c7e50eebd0541e98f8b83efa ┆ my-images/image.jpg ┆ dog ┆ 100.0 ┆ 100.0 ┆ 200.0 ┆ 200.0 |
+----------------------------------+---------------------+-------+-------+-------+-------+--------+
This returns a unique ID for the row that we can use as a handle to interact with the specific row in the remote workspace. To list the added rows on the dataframe you can use the oxen remote diff
command.
oxen remote diff annotations/train.csv
Added Rows
shape: (2, 7)
+----------------------------------+----------------------+-------+-------+-------+-------+--------┐
| _id ┆ file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height |
| --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- |
| str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 |
+----------------------------------+----------------------+-------+-------+-------+-------+--------+
| 822ac1facbd79444f1f33a2a0b2f909d ┆ my-images/image2.jpg ┆ dog ┆ 100.0 ┆ 100.0 ┆ 200.0 ┆ 200.0 |
| ab8e28d66d21934f35efcb9af7ce866f ┆ my-images/image3.jpg ┆ dog ┆ 100.0 ┆ 100.0 ┆ 200.0 ┆ 200.0 |
+----------------------------------+----------------------+-------+-------+-------+-------+--------+
If you want to delete a staged row, you can delete it with the --delete-row
flag and the value in the _id
column.
oxen remote df annotations/train.csv --delete-row 822ac1facbd79444f1f33a2a0b2f909d
To clear all staged rows, you can use the restore
subcommand to restore the file.
oxen remote restore --staged annotations/train.csv
Once you have a set of rows you want to commit on your data frame, simply call oxen remote commit
again.
oxen remote commit -m "adding an image"