Data frames are 2-dimensional tables that organize data into rows and columns like Excel spreadsheets. Whether you’re dealing with csv, parquet, or line delimited json, they can be used to help you view, edit, query, and filter your data.

Below are several examples of data frame usage with our public SpamOrHam repository, which compares spam (spam) text messages to real (ham) ones. You can use these commands to download the data and follow along.

mkdir spam-or-ham
cd spam-or-ham
oxen download datasets/SpamOrHam data.tsv

Look At Your Data

oxen df

Oxen uses the df command for all CLI actions involving data frames. For example, oxen df <FILENAME> displays the contents of tabular data files.

$ oxen df data.tsv

shape: (4_774, 2)
+-----------+---------------------------------+
| dcategory | text                            |
| ---       | ---                             |
| str       | str                             |
+-----------+---------------------------------+
| ham       | Go until jurong point, crazy..|
| ham       | Ok lar... Joking wif u oni...   |
| spam      | Free entry in 2 a wkly comp to… |
| ham       | U dun say so early hor... U c … |
| ham       | Nah I dont think  he goes to u… |
|||
| ham       | Well, im glad you didnt find|
| ham       | Guy, no flash me now. If you g… |
| spam      | Do you want a New Nokia 3510i … |
| ham       | Mark works tomorrow. He gets o… |
| ham       | Keep ur problems in ur heart, … |
+-----------+---------------------------------+

Here, we see that SpamOrHam’s dataset consists of 4,774 rows and 2 columns. The output is automatically truncated to 10 entries. To display the entire data set, you can use the --full flag.

You can also use oxen df options to view your data with modifications. These changes won’t be written anywhere unless you use the --write or --output flags.

# Add extra column
$ oxen df data.tsv --add-col 'language:English:str'

shape: (4_774, 3)
+----------+---------------------------------+----------+
| category | text                            | language |
| ---      | ---                             | ---      |
| str      | str                             | str      |
+----------+---------------------------------+----------+
| ham      | Go until jurong point, crazy..| English  |
| ham      | Ok lar... Joking wif u oni...   | English  |
| spam     | Free entry in 2 a wkly comp to… | English  |
| ham      | U dun say so early hor... U c … | English  |
| ham      | Nah I don't think he goes to u… | English  |
||||
| ham      | Well, i'm glad you didn't find… | English  |
| ham      | Guy, no flash me now. If you g… | English  |
| spam     | Do you want a New Nokia 3510i … | English  |
| ham      | Mark works tomorrow. He gets o… | English  |
| ham      | Keep ur problems in ur heart, … | English  |
+----------+---------------------------------+----------+

# Filter out spam messages, view text only
$ oxen df data.tsv --filter 'category == ham' --columns 'text'

shape: (4_124, 1)
+---------------------------------+
| text                            |
| ---                             |
| str                             |
+---------------------------------+
| Go until jurong point, crazy..|
| Ok lar... Joking wif u oni...   |
| U dun say so early hor... U c … |
| Nah I don't think he goes to u… |
| Even my brother is not like to… |
||
| I want to sent  &lt;#&gt; mesa… |
| Well, i'm glad you didn't find… |
| Guy, no flash me now. If you g… |
| Mark works tomorrow. He gets o… |
| Keep ur problems in ur heart, … |
+---------------------------------+

# Randomize the data, then view the first 5 entries
$ oxen df data.tsv --head 5 --randomize

shape: (5, 2)
+----------+---------------------------------+
| category | text                            |
| ---      | ---                             |
| str      | str                             |
+----------+---------------------------------+
| ham      | He didn't see his shadow. We g… |
| ham      | Thank god they are in bed!      |
| ham      | Where are you ? You said you w… |
| spam     | XCLUSIVE@CLUBSAISAI 2MOROW 28/… |
| ham      | In which place do you want da.  |
+----------+---------------------------------+

Uploading Data

Before modifying your data, add it to a repository to preserve its history. This can be done in the UI, Python, or CLI.

If you’ve pushed to the Oxen Hub, you can view, edit, and query your data directly using the UI.

Editing Data Frames

Once you’ve added your data to an Oxen repository, you can interact with data frames even if they’re not downloaded locally. Oxen exposes a CRUD interface that makes this possible.

from oxen import DataFrame

# Connect to the data frame
df = DataFrame("my-username/spam-or-ham", "data.tsv")

# Add a row
row_id = df.insert_row({"category": "spam", "message": "CLICK HERE TO WIN INSTANTLY."})

# Get a row by id
row = df.get_row_by_id(row_id)
print(row)

# Update a row
row = df.update_row(row_id, {"category": "new_category"})
print(row)

# Delete a row
df.delete_row(row_id)

# Commit the changes
df.commit("Update label")

All of these operations are exposed over HTTP, so you are not limited to using the Python library. Check out all our HTTP reference docs to see how to interact with your data programatically.

You can also edit data files locally with oxen df --write. Any modifications you make with this flag set will be written back to the original file and register as ‘modified’ in your Oxen repository.

$ oxen df data.tsv --filter 'category == spam' --write

shape: (650, 2)
+----------+---------------------------------+
| category | text                            |
| ---      | ---                             |
| str      | str                             |
+----------+---------------------------------+
| spam     | Free entry in 2 a wkly comp to… |
| spam     | FreeMsg Hey there darling it's… |
| spam     | WINNER!! As a valued network c… |
| spam     | Had your mobile 11 months or m… |
| spam     | SIX chances to win CASH! From … |
|||
| spam     | 83039 62735=£450 UK Break Acco… |
| spam     | 5p 4 alfie Moon's Children in|
| spam     | WIN a £200 Shopping spree ever… |
| spam     | This is the 2nd attempt to con… |
| spam     | Do you want a New Nokia 3510i … |
+----------+---------------------------------+
Writing "data.tsv"

$ oxen df data.tsv 

shape: (650, 2)
+----------+---------------------------------+
| category | text                            |
| ---      | ---                             |
| str      | str                             |
+----------+---------------------------------+
| spam     | Free entry in 2 a wkly comp to… |
| spam     | FreeMsg Hey there darling it's… |
| spam     | WINNER!! As a valued network c… |
| spam     | Had your mobile 11 months or m… |
| spam     | SIX chances to win CASH! From … |
|||
| spam     | 83039 62735=£450 UK Break Acco… |
| spam     | 5p 4 alfie Moon's Children in|
| spam     | WIN a £200 Shopping spree ever… |
| spam     | This is the 2nd attempt to con… |
| spam     | Do you want a New Nokia 3510i … |
+----------+---------------------------------+

Oxen uses a combination of polars and duckdb under the hood, and uses the Apache Arrow data format to provide powerful cross application functionality.

Useful Commands

There are many ways you might want to view, transform, and filter your data on the command line before committing changes to the dataset. oxen df provides several options that can help with this.

For these examples, we’ll use our CatDogBBox repository.

Convert Dataset Format

Oxen allows you to quickly transform data files between data formats. When you run oxen df with --output, the resulting data frame will be written to disk as a new file of the specified type.

Some formats like parquet and arrow are more efficient for different tasks, but are not human readable like tsv or csv. These are tradeoffs you’ll have to decide on for your application. Oxen currently supports the following file extensions: csv, tsv, parquet, arrow, json, jsonl.

oxen df train.csv -o train.parquet
shape: (9_000, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    ┆ label ┆ min_x  ┆ min_y  ┆ width  ┆ height |
| ---                     ┆ ---   ┆ ---    ┆ ---    ┆ ---    ┆ ---    |
| str                     ┆ str   ┆ f64    ┆ f64    ┆ f64    ┆ f64    |
|-------------------------+-------+--------+--------+--------+--------|
| images/000000128154.jpg ┆ cat0.019.27130.79129.58 |
| images/000000544590.jpg ┆ cat9.7513.49214.25188.35 |
| images/000000000581.jpg ┆ dog   ┆ 49.3767.7974.29116.08 |
| images/000000236841.jpg ┆ cat115.2196.6593.8742.29  |
| …                       ┆ …     ┆ …      ┆ …      ┆ …      ┆ …      |
| images/000000431980.jpg ┆ dog   ┆ 98.3110.4642.6926.64  |
| images/000000071025.jpg ┆ cat55.33105.45160.1573.57  |
| images/000000518015.jpg ┆ cat43.724.3472.98129.1  |
| images/000000171435.jpg ┆ dog   ┆ 22.86100.03125.5541.61  |
+-------------------------+-------+--------+--------+--------+--------+

Writing "train.parquet"

SQL Query

Oxen has a powerful SQL query engine built in to the CLI. You can run SQL queries on your data frames with the —sql flag.

oxen df train.csv --sql 'SELECT * FROM df WHERE label = "dog"'

shape: (4_860, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    ┆ label ┆ min_x  ┆ min_y  ┆ width  ┆ height |
| ---                     ┆ ---   ┆ ---    ┆ ---    ┆ ---    ┆ ---    |
| str                     ┆ str   ┆ f64    ┆ f64    ┆ f64    ┆ f64    |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000128154.jpg ┆ dog   ┆ 0.019.27130.79129.58 |
| images/000000544590.jpg ┆ dog   ┆ 9.7513.49214.25188.35 |
| images/000000000581.jpg ┆ dog   ┆ 49.3767.7974.29116.08 |
| images/000000236841.jpg ┆ dog   ┆ 115.2196.6593.8742.29  |
| …                       ┆ …     ┆ …      ┆ …      ┆ …      ┆ …      |
| images/000000055645.jpg ┆ dog   ┆ 8.67122.3660.2299.24  |
| images/000000094271.jpg ┆ dog   ┆ 47.6115.26111.57102.27 |
| images/000000041257.jpg ┆ dog   ┆ 6.81117.29207.0686.08  |
| images/000000321014.jpg ┆ dog   ┆ 51.8661.18166.2663.11  |
+-------------------------+-------+--------+--------+--------+--------+
​```

## Filter
If you don't need a full sql query, Oxen also has a lightweight `--filter` option which supports >, <, and == operations

```bash
oxen df train.csv --filter 'width > 100'

shape: (3_483, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    | label | min_x  | min_y  | width  | height |
| ---                     | ---   | ---    | ---    | ---    | ---    |
| str                     | str   | f64    | f64    | f64    | f64    |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000128154.jpg | cat   | 0.0    | 19.27  | 130.79 | 129.58 |
| images/000000544590.jpg | cat   | 9.75   | 13.49  | 214.25 | 188.35 |
| images/000000177913.jpg | dog   | 11.56  | 52.83  | 177.18 | 166.41 |
| images/000000002337.jpg | dog   | 5.42   | 7.28   | 180.01 | 167.21 |
| images/000000012673.jpg | cat   | 117.11 | 98.61  | 106.61 | 47.17  |
|||||||
| images/000000399102.jpg | cat   | 106.03 | 145.22 | 111.91 | 66.68  |
| images/000000155707.jpg | cat   | 14.18  | 13.97  | 165.84 | 207.62 |
| images/000000150919.jpg | cat   | 38.7   | 71.29  | 147.7  | 127.17 |
| images/000000071025.jpg | cat   | 55.33  | 105.45 | 160.15 | 73.57  |
| images/000000171435.jpg | dog   | 22.86  | 100.03 | 125.55 | 41.61  |
+-------------------------+-------+--------+--------+--------+--------+

## View Schema

Oxen automatically detects and versions the schema of your data frame. See the [schema docs](/concepts/schemas) for more information about this.

To view a data frame's schema in full, you can use the `--schema` flag.

```bash
oxen df train.csv --schema
+--------+-------+
| column | dtype |
+----------------+
| file   | str   |
|--------+-------|
| label  | str   |
|--------+-------|
| min_x  | f64   |
|--------+-------|
| min_y  | f64   |
|--------+-------|
| width  | f64   |
|--------+-------|
| height | f64   |
+--------+-------+

View Specific Columns

If you only need a subset of your data frame’s columns, you can specify them in a comma separated list with --columns.

oxen df train.csv --columns 'file,label'
shape: (9_000, 2)
+-------------------------+-------+
| file                    ┆ label |
| ---                     ┆ ---   |
| str                     ┆ str   |
+-------------------------+-------+
| images/000000128154.jpg ┆ cat   |
| images/000000544590.jpg ┆ cat   |
| images/000000000581.jpg ┆ dog   |
| images/000000236841.jpg ┆ cat   |
| …                       ┆ …     |
| images/000000431980.jpg ┆ dog   |
| images/000000071025.jpg ┆ cat   |
| images/000000518015.jpg ┆ cat   |
| images/000000171435.jpg ┆ dog   |
+-------------------------+-------+

Take Indices

You can also view particular rows using --take

oxen df train.csv --take '1,13,42'
shape: (3, 6)
+-------------------------+-------+-------+-------+--------+--------+
| file                    ┆ label ┆ min_x ┆ min_y ┆ width  ┆ height |
| ---                     ┆ ---   ┆ ---   ┆ ---   ┆ ---    ┆ ---    |
| str                     ┆ str   ┆ f64   ┆ f64   ┆ f64    ┆ f64    |
+-------------------------+-------+-------+-------+--------+--------+
| images/000000544590.jpg ┆ cat9.7513.49214.25188.35 |
| images/000000279829.jpg ┆ cat30.0113.5882.51176.39 |
| images/000000209289.jpg ┆ dog   ┆ 72.7542.06111.52153.09 |
+-------------------------+-------+-------+-------+--------+--------+

Unique

Oxen can efficiently compute all the unique values of a given column or set of columns using the --unique option.

oxen df train.csv --unique "file"
oxen df train.csv -u "file,label"

Concatenate (vstack)

If you’ve filtered down your data and want to stack it back into a single frame. The --vstack option takes a variable length list of files you’d like to concatenate.

oxen df train.csv --filter 'label == dog' -o /tmp/dogs.parquet
oxen df train.csv --filter 'label == cat' -o /tmp/cats.parquet
oxen df /tmp/cats.parquet --vstack /tmp/dogs.parquet -o annotations/data.parquet

Add Column

Your data might not match the schema of a data frame you want to combine with, in which case you may need to add a column to match it. You can do this and project default values with --add-col 'col:val:dtype'

oxen df train.csv --add-col 'is_cute:unknown:str'
shape: (9_000, 7)
+-------------------------+-------+--------+--------+--------+--------+---------+
| file                    ┆ label ┆ min_x  ┆ min_y  ┆ width  ┆ height ┆ is_cute |
| ---                     ┆ ---   ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---     |
| str                     ┆ str   ┆ f64    ┆ f64    ┆ f64    ┆ f64    ┆ str     |
+-------------------------+-------+--------+--------+--------+--------+---------+
| images/000000128154.jpg ┆ cat0.019.27130.79129.58 ┆ unknown |
| images/000000544590.jpg ┆ cat9.7513.49214.25188.35 ┆ unknown |
| images/000000000581.jpg ┆ dog   ┆ 49.3767.7974.29116.08 ┆ unknown |
| images/000000236841.jpg ┆ cat115.2196.6593.8742.29  ┆ unknown |
| …                       ┆ …     ┆ …      ┆ …      ┆ …      ┆ …      ┆ …       |
| images/000000431980.jpg ┆ dog   ┆ 98.3110.4642.6926.64  ┆ unknown |
| images/000000071025.jpg ┆ cat55.33105.45160.1573.57  ┆ unknown |
| images/000000518015.jpg ┆ cat43.724.3472.98129.1  ┆ unknown |
| images/000000171435.jpg ┆ dog   ┆ 22.86100.03125.5541.61  ┆ unknown |
+-------------------------+-------+--------+--------+--------+--------+---------+

Add Row

You can also append new rows to the data frame. The --add-row option takes in a comma separated list of values and automatically parses the correct dtypes.

oxen df train.csv --add-row 'images/my_cat.jpg,cat,0,0,0,0'
shape: (9_001, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    ┆ label ┆ min_x  ┆ min_y  ┆ width  ┆ height |
| ---                     ┆ ---   ┆ ---    ┆ ---    ┆ ---    ┆ ---    |
| str                     ┆ str   ┆ f64    ┆ f64    ┆ f64    ┆ f64    |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000128154.jpg ┆ cat0.019.27130.79129.58 |
| images/000000544590.jpg ┆ cat9.7513.49214.25188.35 |
| images/000000000581.jpg ┆ dog   ┆ 49.3767.7974.29116.08 |
| images/000000236841.jpg ┆ cat115.2196.6593.8742.29  |
| …                       ┆ …     ┆ …      ┆ …      ┆ …      ┆ …      |
| images/000000071025.jpg ┆ cat55.33105.45160.1573.57  |
| images/000000518015.jpg ┆ cat43.724.3472.98129.1  |
| images/000000171435.jpg ┆ dog   ┆ 22.86100.03125.5541.61  |
| images/my_cat.jpg       ┆ cat0.00.00.00.0    |
+-------------------------+-------+--------+--------+--------+--------+

Randomize

Often, you’ll want to randomize data before splitting into train and test sets, or just to peek at different data values. This can be done with the --randomize flag.

oxen df train.csv --randomize
shape: (9_000, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    ┆ label ┆ min_x  ┆ min_y  ┆ width  ┆ height |
| ---                     ┆ ---   ┆ ---    ┆ ---    ┆ ---    ┆ ---    |
| str                     ┆ str   ┆ f64    ┆ f64    ┆ f64    ┆ f64    |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000124002.jpg ┆ cat82.928.31108.31158.48 |
| images/000000207597.jpg ┆ dog   ┆ 75.643.65125.47218.19 |
| images/000000113810.jpg ┆ cat104.3444.65119.66159.42 |
| images/000000340160.jpg ┆ dog   ┆ 79.7889.31127.1103.66 |
| …                       ┆ …     ┆ …      ┆ …      ┆ …      ┆ …      |
| images/000000310573.jpg ┆ dog   ┆ 102.5591.4842.2452.18  |
| images/000000162801.jpg ┆ cat112.9675.0557.3898.19  |
| images/000000544117.jpg ┆ dog   ┆ 108.16124.2811.0864.58  |
| images/000000283210.jpg ┆ dog   ┆ 49.3740.01174.43182.0  |
+-------------------------+-------+--------+--------+--------+--------+

Sort

You can sort your data with the sort flag. You can sort the data by the values of any column in your data frame.

oxen df train.csv --sort "height"
shape: (9_000, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    ┆ label ┆ min_x  ┆ min_y  ┆ width  ┆ height |
| ---                     ┆ ---   ┆ ---    ┆ ---    ┆ ---    ┆ ---    |
| str                     ┆ str   ┆ f64    ┆ f64    ┆ f64    ┆ f64    |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000580919.jpg ┆ dog   ┆ 61.2888.312.711.83   |
| images/000000577310.jpg ┆ dog   ┆ 132.25193.863.281.95   |
| images/000000393384.jpg ┆ dog   ┆ 138.8589.891.252.11   |
| images/000000477398.jpg ┆ dog   ┆ 185.11195.932.512.6    |
| …                       ┆ …     ┆ …      ┆ …      ┆ …      ┆ …      |
| images/000000069205.jpg ┆ dog   ┆ 0.00.0224.0224.0  |
| images/000000554737.jpg ┆ cat0.00.0224.0224.0  |
| images/000000213819.jpg ┆ cat8.320.0207.77224.0  |
| images/000000397212.jpg ┆ cat0.360.0115.5224.0  |
+-------------------------+-------+--------+--------+--------+--------+

Reverse

You can also reverse the order of a data table. By default --sort sorts in ascending order, but this can be switched with the --reverse flag.

oxen df train.csv --reverse
shape: (9_000, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file                    | label | min_x  | min_y  | width  | height |
| ---                     | ---   | ---    | ---    | ---    | ---    |
| str                     | str   | f64    | f64    | f64    | f64    |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000397212.jpg | cat   | 0.36   | 0.0    | 115.5  | 224.0  |
| images/000000213819.jpg | cat   | 8.32   | 0.0    | 207.77 | 224.0  |
| images/000000554737.jpg | cat   | 0.0    | 0.0    | 224.0  | 224.0  |
| images/000000069205.jpg | dog   | 0.0    | 0.0    | 224.0  | 224.0  |
| images/000000242607.jpg | dog   | 0.6    | 0.0    | 185.31 | 224.0  |
|||||||
| images/000000371532.jpg | dog   | 34.43  | 100.07 | 6.47   | 2.71   |
| images/000000477398.jpg | dog   | 185.11 | 195.93 | 2.51   | 2.6    |
| images/000000393384.jpg | dog   | 138.85 | 89.89  | 1.25   | 2.11   |
| images/000000577310.jpg | dog   | 132.25 | 193.86 | 3.28   | 1.95   |
| images/000000580919.jpg | dog   | 61.28  | 88.31  | 2.71   | 1.83   |
+-------------------------+-------+--------+--------+--------+--------+