Whether it is csv, parquet, or line delimited json, it is useful to store your training data in data frames that we can filter, aggregate, slice and dice.

To follow along with the examples below feel free to grab the example data from our public SpamOrHam repository.

mkdir spam-or-ham
cd spam-or-ham
oxen download datasets/SpamOrHam data.tsv

Look At Your Data

Oxen comes with a convenient df subcommand to view your data frames on disk. This is useful for quickly inspecting your data before you start modifying it.

$ oxen df data.tsv

shape: (4_774, 2)
+-----------+---------------------------------+
| dcategory | text                            |
| ---       | ---                             |
| str       | str                             |
+-----------+---------------------------------+
| ham       | Go until jurong point, crazy..โ€ฆ |
| ham       | Ok lar... Joking wif u oni...   |
| spam      | Free entry in 2 a wkly comp toโ€ฆ |
| ham       | U dun say so early hor... U c โ€ฆ |
| ham       | Nah I dont think  he goes to uโ€ฆ |
| โ€ฆ         | โ€ฆ                               |
| ham       | Well, im glad you didnt find  โ€ฆ |
| ham       | Guy, no flash me now. If you gโ€ฆ |
| spam      | Do you want a New Nokia 3510i โ€ฆ |
| ham       | Mark works tomorrow. He gets oโ€ฆ |
| ham       | Keep ur problems in ur heart, โ€ฆ |
+-----------+---------------------------------+

Upload Your Data

Then add the data to a repository of your own so that you can modify it. You can do this in the UI, Python, or CLI.

If you have pushed to the Oxen Hub, you can view, edit, and query your data directly using the UI.

Spam or Ham Data Frame

Editing Data Frames

Oxen allows you to interact with data frames that are not downloaded to your local machine. This can be useful for data collection, labeling workflows, or quickly inspecting data without having to download it.

Once you have pushed your data to an Oxen repository, Oxen exposes a CRUD interface to interact with the rows.

from oxen import DataFrame

# Connect to the data frame
df = DataFrame("my-username/spam-or-ham", "data.tsv")

# Add a row
row_id = df.insert_row({"category": "spam", "message": "CLICK HERE TO WIN INSTANTLY."})

# Get a row by id
row = df.get_row_by_id(row_id)
print(row)

# Update a row
row = df.update_row(row_id, {"category": "new_category"})
print(row)

# Delete a row
df.delete_row(row_id)

# Commit the changes
df.commit("Update label")

All of these operations are exposed over HTTP, so you are not limited to using the Python library. Check out all our HTTP reference docs to see how to interact with your data programatically.

Local Data Frames

oxen df

Oxen has a convenient df (short for โ€œData Frameโ€) command to deal with tabular data. This example data has 10,000 rows and 6 columns of bounding boxes around cats or dogs. The shape hint at the top of the output can be useful for making sure you are transforming the data correctly.

oxen df train.csv
shape: (9_000, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    โ”† label โ”† min_x  โ”† min_y  โ”† width  โ”† height |
| ---                     โ”† ---   โ”† ---    โ”† ---    โ”† ---    โ”† ---    |
| str                     โ”† str   โ”† f64    โ”† f64    โ”† f64    โ”† f64    |
|-------------------------+-------+--------+--------+--------+--------|
| images/000000128154.jpg โ”† cat   โ”† 0.0    โ”† 19.27  โ”† 130.79 โ”† 129.58 |
| images/000000544590.jpg โ”† cat   โ”† 9.75   โ”† 13.49  โ”† 214.25 โ”† 188.35 |
| images/000000000581.jpg โ”† dog   โ”† 49.37  โ”† 67.79  โ”† 74.29  โ”† 116.08 |
| images/000000236841.jpg โ”† cat   โ”† 115.21 โ”† 96.65  โ”† 93.87  โ”† 42.29  |
| โ€ฆ                       โ”† โ€ฆ     โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      |
| images/000000431980.jpg โ”† dog   โ”† 98.3   โ”† 110.46 โ”† 42.69  โ”† 26.64  |
| images/000000071025.jpg โ”† cat   โ”† 55.33  โ”† 105.45 โ”† 160.15 โ”† 73.57  |
| images/000000518015.jpg โ”† cat   โ”† 43.72  โ”† 4.34   โ”† 72.98  โ”† 129.1  |
| images/000000171435.jpg โ”† dog   โ”† 22.86  โ”† 100.03 โ”† 125.55 โ”† 41.61  |
+-------------------------+-------+--------+--------+--------+--------+

Oxen uses a combination of polars and duckdb under the hood, and uses the Apache Arrow data format to provide powerful cross application functionality.

Useful Commands

There are many ways you might want to view, transform, and filter your data on the command line before committing to the version of the dataset.

To quickly see all the options on the df command you can run oxen df --help.

Convert Dataset Format

The --output option is handy for quickly transforming data files between data formats on disk. Some formats like parquet and arrow are more efficient for data different tasks, but are not human readable like tsv or csv. Data format is always a trade off youโ€™ll have to decide on for your application.

Oxen currently supports these file extensions: csv, tsv, parquet, arrow, json, jsonl.

oxen df train.csv -o train.parquet
shape: (9_000, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    โ”† label โ”† min_x  โ”† min_y  โ”† width  โ”† height |
| ---                     โ”† ---   โ”† ---    โ”† ---    โ”† ---    โ”† ---    |
| str                     โ”† str   โ”† f64    โ”† f64    โ”† f64    โ”† f64    |
|-------------------------+-------+--------+--------+--------+--------|
| images/000000128154.jpg โ”† cat   โ”† 0.0    โ”† 19.27  โ”† 130.79 โ”† 129.58 |
| images/000000544590.jpg โ”† cat   โ”† 9.75   โ”† 13.49  โ”† 214.25 โ”† 188.35 |
| images/000000000581.jpg โ”† dog   โ”† 49.37  โ”† 67.79  โ”† 74.29  โ”† 116.08 |
| images/000000236841.jpg โ”† cat   โ”† 115.21 โ”† 96.65  โ”† 93.87  โ”† 42.29  |
| โ€ฆ                       โ”† โ€ฆ     โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      |
| images/000000431980.jpg โ”† dog   โ”† 98.3   โ”† 110.46 โ”† 42.69  โ”† 26.64  |
| images/000000071025.jpg โ”† cat   โ”† 55.33  โ”† 105.45 โ”† 160.15 โ”† 73.57  |
| images/000000518015.jpg โ”† cat   โ”† 43.72  โ”† 4.34   โ”† 72.98  โ”† 129.1  |
| images/000000171435.jpg โ”† dog   โ”† 22.86  โ”† 100.03 โ”† 125.55 โ”† 41.61  |
+-------------------------+-------+--------+--------+--------+--------+

Writing "train.parquet"

SQL Query

Oxen has a powerful SQL query engine built in to the CLI. You can run SQL queries on your data frames with the --sql flag.

oxen df train.csv --sql 'SELECT * FROM df WHERE label = "dog"'
shape: (4_860, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    โ”† label โ”† min_x  โ”† min_y  โ”† width  โ”† height |
| ---                     โ”† ---   โ”† ---    โ”† ---    โ”† ---    โ”† ---    |
| str                     โ”† str   โ”† f64    โ”† f64    โ”† f64    โ”† f64    |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000128154.jpg โ”† dog   โ”† 0.0    โ”† 19.27  โ”† 130.79 โ”† 129.58 |
| images/000000544590.jpg โ”† dog   โ”† 9.75   โ”† 13.49  โ”† 214.25 โ”† 188.35 |
| images/000000000581.jpg โ”† dog   โ”† 49.37  โ”† 67.79  โ”† 74.29  โ”† 116.08 |
| images/000000236841.jpg โ”† dog   โ”† 115.21 โ”† 96.65  โ”† 93.87  โ”† 42.29  |
| โ€ฆ                       โ”† โ€ฆ     โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      |
| images/000000055645.jpg โ”† dog   โ”† 8.67   โ”† 122.36 โ”† 60.22  โ”† 99.24  |
| images/000000094271.jpg โ”† dog   โ”† 47.6   โ”† 115.26 โ”† 111.57 โ”† 102.27 |
| images/000000041257.jpg โ”† dog   โ”† 6.81   โ”† 117.29 โ”† 207.06 โ”† 86.08  |
| images/000000321014.jpg โ”† dog   โ”† 51.86  โ”† 61.18  โ”† 166.26 โ”† 63.11  |
+-------------------------+-------+--------+--------+--------+--------+

Text2SQL

If you are too lazy to write SQL queries, Oxen also has a powerful text2sql engine built in to the CLI. You can run text2sql queries on your data frames with the --text2sql flag. This uses an LLM to convert natural language queries to SQL queries. This can be useful for quickly querying data frames without having to remember SQL syntax.

oxen df train.csv --text2sql 'show me all the rows where the label is dog'
shape: (4_860, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    โ”† label โ”† min_x  โ”† min_y  โ”† width  โ”† height |
| ---                     โ”† ---   โ”† ---    โ”† ---    โ”† ---    โ”† ---    |
| str                     โ”† str   โ”† f64    โ”† f64    โ”† f64    โ”† f64    |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000128154.jpg โ”† dog   โ”† 0.0    โ”† 19.27  โ”† 130.79 โ”† 129.58 |
| images/000000544590.jpg โ”† dog   โ”† 9.75   โ”† 13.49  โ”† 214.25 โ”† 188.35 |
| images/000000000581.jpg โ”† dog   โ”† 49.37  โ”† 67.79  โ”† 74.29  โ”† 116.08 |
| images/000000236841.jpg โ”† dog   โ”† 115.21 โ”† 96.65  โ”† 93.87  โ”† 42.29  |
| โ€ฆ                       โ”† โ€ฆ     โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      |
| images/000000055645.jpg โ”† dog   โ”† 8.67   โ”† 122.36 โ”† 60.22  โ”† 99.24  |
| images/000000094271.jpg โ”† dog   โ”† 47.6   โ”† 115.26 โ”† 111.57 โ”† 102.27 |
| images/000000041257.jpg โ”† dog   โ”† 6.81   โ”† 117.29 โ”† 207.06 โ”† 86.08  |
| images/000000321014.jpg โ”† dog   โ”† 51.86  โ”† 61.18  โ”† 166.26 โ”† 63.11  |
+-------------------------+-------+--------+--------+--------+--------+

NOTE: The text2sql engine is still in development and may not work for all queries. It also requires you to have an Oxen.ai API key setup.

Randomize

Often you will want to randomize data before splitting into train and test sets, or even just to peek at different data values.

oxen df train.csv --randomize
shape: (9_000, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    โ”† label โ”† min_x  โ”† min_y  โ”† width  โ”† height |
| ---                     โ”† ---   โ”† ---    โ”† ---    โ”† ---    โ”† ---    |
| str                     โ”† str   โ”† f64    โ”† f64    โ”† f64    โ”† f64    |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000124002.jpg โ”† cat   โ”† 82.92  โ”† 8.31   โ”† 108.31 โ”† 158.48 |
| images/000000207597.jpg โ”† dog   โ”† 75.64  โ”† 3.65   โ”† 125.47 โ”† 218.19 |
| images/000000113810.jpg โ”† cat   โ”† 104.34 โ”† 44.65  โ”† 119.66 โ”† 159.42 |
| images/000000340160.jpg โ”† dog   โ”† 79.78  โ”† 89.31  โ”† 127.1  โ”† 103.66 |
| โ€ฆ                       โ”† โ€ฆ     โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      |
| images/000000310573.jpg โ”† dog   โ”† 102.55 โ”† 91.48  โ”† 42.24  โ”† 52.18  |
| images/000000162801.jpg โ”† cat   โ”† 112.96 โ”† 75.05  โ”† 57.38  โ”† 98.19  |
| images/000000544117.jpg โ”† dog   โ”† 108.16 โ”† 124.28 โ”† 11.08  โ”† 64.58  |
| images/000000283210.jpg โ”† dog   โ”† 49.37  โ”† 40.01  โ”† 174.43 โ”† 182.0  |
+-------------------------+-------+--------+--------+--------+--------+

View Schema

Oxen automatically detects and versions the schema of your data frame. See the schema docs for more information on the power of Oxen schemas.

To view a data frameโ€™s schema in full, you can use the --schema flag to display the full schema of this data frame.

oxen df train.csv --schema
+--------+-------+
| column | dtype |
+----------------+
| file   | str   |
|--------+-------|
| label  | str   |
|--------+-------|
| min_x  | f64   |
|--------+-------|
| min_y  | f64   |
|--------+-------|
| width  | f64   |
|--------+-------|
| height | f64   |
+--------+-------+

View Specific Columns

Maybe you have many columns, and only need to work with a few. You can specify column names in a comma separated list with --columns.

oxen df train.csv --columns 'file,label'
shape: (9_000, 2)
+-------------------------+-------+
| file                    โ”† label |
| ---                     โ”† ---   |
| str                     โ”† str   |
+-------------------------+-------+
| images/000000128154.jpg โ”† cat   |
| images/000000544590.jpg โ”† cat   |
| images/000000000581.jpg โ”† dog   |
| images/000000236841.jpg โ”† cat   |
| โ€ฆ                       โ”† โ€ฆ     |
| images/000000431980.jpg โ”† dog   |
| images/000000071025.jpg โ”† cat   |
| images/000000518015.jpg โ”† cat   |
| images/000000171435.jpg โ”† dog   |
+-------------------------+-------+

Concatenate (vstack)

Maybe you have filtered down data, and want to stack the data back into a single frame. The --vstack option takes a variable length list of files you would like to concatenate.

oxen df train.csv --filter 'label-dog' -o /tmp/dogs.parquet
oxen df train.csv --filter 'label-cat' -o /tmp/cats.parquet
oxen df /tmp/cats.parquet --vstack /tmp/dogs.parquet -o annotations/data.parquet

Take Indices

Sometimes you have a specific row or set of rows of data you would like to look at. This is where the --take option comes in handy.

oxen df train.csv --take '1,13,42'
shape: (3, 6)
+-------------------------+-------+-------+-------+--------+--------+
| file                    โ”† label โ”† min_x โ”† min_y โ”† width  โ”† height |
| ---                     โ”† ---   โ”† ---   โ”† ---   โ”† ---    โ”† ---    |
| str                     โ”† str   โ”† f64   โ”† f64   โ”† f64    โ”† f64    |
+-------------------------+-------+-------+-------+--------+--------+
| images/000000544590.jpg โ”† cat   โ”† 9.75  โ”† 13.49 โ”† 214.25 โ”† 188.35 |
| images/000000279829.jpg โ”† cat   โ”† 30.01 โ”† 13.58 โ”† 82.51  โ”† 176.39 |
| images/000000209289.jpg โ”† dog   โ”† 72.75 โ”† 42.06 โ”† 111.52 โ”† 153.09 |
+-------------------------+-------+-------+-------+--------+--------+

Add Column

Your data might not match the schema of a data frame you want to combine with, in this case you may need to add a column to match the schema. You can do this and project default values with --add-col 'col:val:dtype'

oxen df train.csv --add-col 'is_cute:unknown:str'
shape: (9_000, 7)
+-------------------------+-------+--------+--------+--------+--------+---------+
| file                    โ”† label โ”† min_x  โ”† min_y  โ”† width  โ”† height โ”† is_cute |
| ---                     โ”† ---   โ”† ---    โ”† ---    โ”† ---    โ”† ---    โ”† ---     |
| str                     โ”† str   โ”† f64    โ”† f64    โ”† f64    โ”† f64    โ”† str     |
+-------------------------+-------+--------+--------+--------+--------+---------+
| images/000000128154.jpg โ”† cat   โ”† 0.0    โ”† 19.27  โ”† 130.79 โ”† 129.58 โ”† unknown |
| images/000000544590.jpg โ”† cat   โ”† 9.75   โ”† 13.49  โ”† 214.25 โ”† 188.35 โ”† unknown |
| images/000000000581.jpg โ”† dog   โ”† 49.37  โ”† 67.79  โ”† 74.29  โ”† 116.08 โ”† unknown |
| images/000000236841.jpg โ”† cat   โ”† 115.21 โ”† 96.65  โ”† 93.87  โ”† 42.29  โ”† unknown |
| โ€ฆ                       โ”† โ€ฆ     โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ       |
| images/000000431980.jpg โ”† dog   โ”† 98.3   โ”† 110.46 โ”† 42.69  โ”† 26.64  โ”† unknown |
| images/000000071025.jpg โ”† cat   โ”† 55.33  โ”† 105.45 โ”† 160.15 โ”† 73.57  โ”† unknown |
| images/000000518015.jpg โ”† cat   โ”† 43.72  โ”† 4.34   โ”† 72.98  โ”† 129.1  โ”† unknown |
| images/000000171435.jpg โ”† dog   โ”† 22.86  โ”† 100.03 โ”† 125.55 โ”† 41.61  โ”† unknown |
+-------------------------+-------+--------+--------+--------+--------+---------+

Add Row

Sometimes it can be a pain to append data to a data file without writing code to do so. The --add-row option makes it as easy as a comma separated list and automatically parses the data to the correct dtypes.

oxen df train.csv --add-row 'images/my_cat.jpg,cat,0,0,0,0'
shape: (9_001, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    โ”† label โ”† min_x  โ”† min_y  โ”† width  โ”† height |
| ---                     โ”† ---   โ”† ---    โ”† ---    โ”† ---    โ”† ---    |
| str                     โ”† str   โ”† f64    โ”† f64    โ”† f64    โ”† f64    |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000128154.jpg โ”† cat   โ”† 0.0    โ”† 19.27  โ”† 130.79 โ”† 129.58 |
| images/000000544590.jpg โ”† cat   โ”† 9.75   โ”† 13.49  โ”† 214.25 โ”† 188.35 |
| images/000000000581.jpg โ”† dog   โ”† 49.37  โ”† 67.79  โ”† 74.29  โ”† 116.08 |
| images/000000236841.jpg โ”† cat   โ”† 115.21 โ”† 96.65  โ”† 93.87  โ”† 42.29  |
| โ€ฆ                       โ”† โ€ฆ     โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      |
| images/000000071025.jpg โ”† cat   โ”† 55.33  โ”† 105.45 โ”† 160.15 โ”† 73.57  |
| images/000000518015.jpg โ”† cat   โ”† 43.72  โ”† 4.34   โ”† 72.98  โ”† 129.1  |
| images/000000171435.jpg โ”† dog   โ”† 22.86  โ”† 100.03 โ”† 125.55 โ”† 41.61  |
| images/my_cat.jpg       โ”† cat   โ”† 0.0    โ”† 0.0    โ”† 0.0    โ”† 0.0    |
+-------------------------+-------+--------+--------+--------+--------+

Unique

Oxen can efficiently compute all the unique values given a column name, or comma separated list of column names.

oxen df train.csv --unique "file"
oxen df train.csv -u "file,label"

Sort

Sorting can be achieved with the sort flag. For example you may want to find the largest bounding boxes by sorting on the height column.

oxen df train.csv --sort "height"
shape: (9_000, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    โ”† label โ”† min_x  โ”† min_y  โ”† width  โ”† height |
| ---                     โ”† ---   โ”† ---    โ”† ---    โ”† ---    โ”† ---    |
| str                     โ”† str   โ”† f64    โ”† f64    โ”† f64    โ”† f64    |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000580919.jpg โ”† dog   โ”† 61.28  โ”† 88.31  โ”† 2.71   โ”† 1.83   |
| images/000000577310.jpg โ”† dog   โ”† 132.25 โ”† 193.86 โ”† 3.28   โ”† 1.95   |
| images/000000393384.jpg โ”† dog   โ”† 138.85 โ”† 89.89  โ”† 1.25   โ”† 2.11   |
| images/000000477398.jpg โ”† dog   โ”† 185.11 โ”† 195.93 โ”† 2.51   โ”† 2.6    |
| โ€ฆ                       โ”† โ€ฆ     โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      โ”† โ€ฆ      |
| images/000000069205.jpg โ”† dog   โ”† 0.0    โ”† 0.0    โ”† 224.0  โ”† 224.0  |
| images/000000554737.jpg โ”† cat   โ”† 0.0    โ”† 0.0    โ”† 224.0  โ”† 224.0  |
| images/000000213819.jpg โ”† cat   โ”† 8.32   โ”† 0.0    โ”† 207.77 โ”† 224.0  |
| images/000000397212.jpg โ”† cat   โ”† 0.36   โ”† 0.0    โ”† 115.5  โ”† 224.0  |
+-------------------------+-------+--------+--------+--------+--------+

Reverse

You can also reverse the order of a data table. By default --sort sorts in ascending order, but can be reversed with the --reverse flag.

oxen df train.csv --reverse
shape: (7_128, 2)
+-------------------------+----------------+
| file                    โ”† count('label') |
| ---                     โ”† ---            |
| str                     โ”† u32            |
+-------------------------+----------------+
| images/000000315555.jpg โ”† 19             |
| images/000000016950.jpg โ”† 19             |
| images/000000244933.jpg โ”† 17             |
| images/000000113762.jpg โ”† 14             |
| โ€ฆ                       โ”† โ€ฆ              |
| images/000000026942.jpg โ”† 1              |
| images/000000491845.jpg โ”† 1              |
| images/000000536154.jpg โ”† 1              |
| images/000000559557.jpg โ”† 1              |
+-------------------------+----------------+