Data Frames
Oxen provides a powerful data frame library that allows you to interact with tabular data.
Data frames are 2-dimensional tables that organize data into rows and columns like Excel spreadsheets. Whether youโre dealing with csv, parquet, or line delimited json, they can be used to help you view, edit, query, and filter your data.
Below are several examples of data frame usage with our public SpamOrHam repository, which compares spam (spam) text messages to real (ham) ones. You can use these commands to download the data and follow along.
mkdir spam-or-ham
cd spam-or-ham
oxen download datasets/SpamOrHam data.tsv
Look At Your Data
oxen df
Oxen uses the df
command for all CLI actions involving data frames. For example, oxen df <FILENAME>
displays the contents of tabular data files.
$ oxen df data.tsv
shape: (4_774, 2)
+-----------+---------------------------------+
| dcategory | text |
| --- | --- |
| str | str |
+-----------+---------------------------------+
| ham | Go until jurong point, crazy..โฆ |
| ham | Ok lar... Joking wif u oni... |
| spam | Free entry in 2 a wkly comp toโฆ |
| ham | U dun say so early hor... U c โฆ |
| ham | Nah I dont think he goes to uโฆ |
| โฆ | โฆ |
| ham | Well, im glad you didnt find โฆ |
| ham | Guy, no flash me now. If you gโฆ |
| spam | Do you want a New Nokia 3510i โฆ |
| ham | Mark works tomorrow. He gets oโฆ |
| ham | Keep ur problems in ur heart, โฆ |
+-----------+---------------------------------+
Here, we see that SpamOrHamโs dataset consists of 4,774 rows and 2 columns. The output is automatically truncated to 10 entries. To display the entire data set, you can use the --full
flag.
You can also use oxen df
options to view your data with modifications. These changes wonโt be written anywhere unless you use the --write
or --output
flags.
# Add extra column
$ oxen df data.tsv --add-col 'language:English:str'
shape: (4_774, 3)
+----------+---------------------------------+----------+
| category | text | language |
| --- | --- | --- |
| str | str | str |
+----------+---------------------------------+----------+
| ham | Go until jurong point, crazy..โฆ | English |
| ham | Ok lar... Joking wif u oni... | English |
| spam | Free entry in 2 a wkly comp toโฆ | English |
| ham | U dun say so early hor... U c โฆ | English |
| ham | Nah I don't think he goes to uโฆ | English |
| โฆ | โฆ | โฆ |
| ham | Well, i'm glad you didn't findโฆ | English |
| ham | Guy, no flash me now. If you gโฆ | English |
| spam | Do you want a New Nokia 3510i โฆ | English |
| ham | Mark works tomorrow. He gets oโฆ | English |
| ham | Keep ur problems in ur heart, โฆ | English |
+----------+---------------------------------+----------+
# Filter out spam messages, view text only
$ oxen df data.tsv --filter 'category == ham' --columns 'text'
shape: (4_124, 1)
+---------------------------------+
| text |
| --- |
| str |
+---------------------------------+
| Go until jurong point, crazy..โฆ |
| Ok lar... Joking wif u oni... |
| U dun say so early hor... U c โฆ |
| Nah I don't think he goes to uโฆ |
| Even my brother is not like toโฆ |
| โฆ |
| I want to sent <#> mesaโฆ |
| Well, i'm glad you didn't findโฆ |
| Guy, no flash me now. If you gโฆ |
| Mark works tomorrow. He gets oโฆ |
| Keep ur problems in ur heart, โฆ |
+---------------------------------+
# Randomize the data, then view the first 5 entries
$ oxen df data.tsv --head 5 --randomize
shape: (5, 2)
+----------+---------------------------------+
| category | text |
| --- | --- |
| str | str |
+----------+---------------------------------+
| ham | He didn't see his shadow. We gโฆ |
| ham | Thank god they are in bed! |
| ham | Where are you ? You said you wโฆ |
| spam | XCLUSIVE@CLUBSAISAI 2MOROW 28/โฆ |
| ham | In which place do you want da. |
+----------+---------------------------------+
Uploading Data
Before modifying your data, add it to a repository to preserve its history. This can be done in the UI, Python, or CLI.
If youโve pushed to the Oxen Hub, you can view, edit, and query your data directly using the UI.
Editing Data Frames
Once youโve added your data to an Oxen repository, you can interact with data frames even if theyโre not downloaded locally. Oxen exposes a CRUD interface that makes this possible.
from oxen import DataFrame
# Connect to the data frame
df = DataFrame("my-username/spam-or-ham", "data.tsv")
# Add a row
row_id = df.insert_row({"category": "spam", "message": "CLICK HERE TO WIN INSTANTLY."})
# Get a row by id
row = df.get_row_by_id(row_id)
print(row)
# Update a row
row = df.update_row(row_id, {"category": "new_category"})
print(row)
# Delete a row
df.delete_row(row_id)
# Commit the changes
df.commit("Update label")
All of these operations are exposed over HTTP, so you are not limited to using the Python library. Check out all our HTTP reference docs to see how to interact with your data programatically.
You can also edit data files locally with oxen df --write
. Any modifications you make with this flag set will be written back to the original file and register as โmodifiedโ in your Oxen repository.
$ oxen df data.tsv --filter 'category == spam' --write
shape: (650, 2)
+----------+---------------------------------+
| category | text |
| --- | --- |
| str | str |
+----------+---------------------------------+
| spam | Free entry in 2 a wkly comp toโฆ |
| spam | FreeMsg Hey there darling it'sโฆ |
| spam | WINNER!! As a valued network cโฆ |
| spam | Had your mobile 11 months or mโฆ |
| spam | SIX chances to win CASH! From โฆ |
| โฆ | โฆ |
| spam | 83039 62735=ยฃ450 UK Break Accoโฆ |
| spam | 5p 4 alfie Moon's Children in โฆ |
| spam | WIN a ยฃ200 Shopping spree everโฆ |
| spam | This is the 2nd attempt to conโฆ |
| spam | Do you want a New Nokia 3510i โฆ |
+----------+---------------------------------+
Writing "data.tsv"
$ oxen df data.tsv
shape: (650, 2)
+----------+---------------------------------+
| category | text |
| --- | --- |
| str | str |
+----------+---------------------------------+
| spam | Free entry in 2 a wkly comp toโฆ |
| spam | FreeMsg Hey there darling it'sโฆ |
| spam | WINNER!! As a valued network cโฆ |
| spam | Had your mobile 11 months or mโฆ |
| spam | SIX chances to win CASH! From โฆ |
| โฆ | โฆ |
| spam | 83039 62735=ยฃ450 UK Break Accoโฆ |
| spam | 5p 4 alfie Moon's Children in โฆ |
| spam | WIN a ยฃ200 Shopping spree everโฆ |
| spam | This is the 2nd attempt to conโฆ |
| spam | Do you want a New Nokia 3510i โฆ |
+----------+---------------------------------+
Oxen uses a combination of polars and duckdb under the hood, and uses the Apache Arrow data format to provide powerful cross application functionality.
Useful Commands
There are many ways you might want to view, transform, and filter your data on the command line before committing changes to the dataset. oxen df
provides several options that can help with this.
For these examples, weโll use our CatDogBBox repository.
Convert Dataset Format
Oxen allows you to quickly transform data files between data formats. When you run oxen df
with --output
, the resulting data frame will be written to disk as a new file of the specified type.
Some formats like parquet and arrow are more efficient for different tasks, but are not human readable like tsv or csv. These are tradeoffs youโll have to decide on for your application. Oxen currently supports the following file extensions: csv
, tsv
, parquet
, arrow
, json
, jsonl
.
oxen df train.csv -o train.parquet
shape: (9_000, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file โ label โ min_x โ min_y โ width โ height |
| --- โ --- โ --- โ --- โ --- โ --- |
| str โ str โ f64 โ f64 โ f64 โ f64 |
|-------------------------+-------+--------+--------+--------+--------|
| images/000000128154.jpg โ cat โ 0.0 โ 19.27 โ 130.79 โ 129.58 |
| images/000000544590.jpg โ cat โ 9.75 โ 13.49 โ 214.25 โ 188.35 |
| images/000000000581.jpg โ dog โ 49.37 โ 67.79 โ 74.29 โ 116.08 |
| images/000000236841.jpg โ cat โ 115.21 โ 96.65 โ 93.87 โ 42.29 |
| โฆ โ โฆ โ โฆ โ โฆ โ โฆ โ โฆ |
| images/000000431980.jpg โ dog โ 98.3 โ 110.46 โ 42.69 โ 26.64 |
| images/000000071025.jpg โ cat โ 55.33 โ 105.45 โ 160.15 โ 73.57 |
| images/000000518015.jpg โ cat โ 43.72 โ 4.34 โ 72.98 โ 129.1 |
| images/000000171435.jpg โ dog โ 22.86 โ 100.03 โ 125.55 โ 41.61 |
+-------------------------+-------+--------+--------+--------+--------+
Writing "train.parquet"
SQL Query
Oxen has a powerful SQL query engine built in to the CLI. You can run SQL queries on your data frames with the โsql flag.
oxen df train.csv --sql 'SELECT * FROM df WHERE label = "dog"'
shape: (4_860, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file โ label โ min_x โ min_y โ width โ height |
| --- โ --- โ --- โ --- โ --- โ --- |
| str โ str โ f64 โ f64 โ f64 โ f64 |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000128154.jpg โ dog โ 0.0 โ 19.27 โ 130.79 โ 129.58 |
| images/000000544590.jpg โ dog โ 9.75 โ 13.49 โ 214.25 โ 188.35 |
| images/000000000581.jpg โ dog โ 49.37 โ 67.79 โ 74.29 โ 116.08 |
| images/000000236841.jpg โ dog โ 115.21 โ 96.65 โ 93.87 โ 42.29 |
| โฆ โ โฆ โ โฆ โ โฆ โ โฆ โ โฆ |
| images/000000055645.jpg โ dog โ 8.67 โ 122.36 โ 60.22 โ 99.24 |
| images/000000094271.jpg โ dog โ 47.6 โ 115.26 โ 111.57 โ 102.27 |
| images/000000041257.jpg โ dog โ 6.81 โ 117.29 โ 207.06 โ 86.08 |
| images/000000321014.jpg โ dog โ 51.86 โ 61.18 โ 166.26 โ 63.11 |
+-------------------------+-------+--------+--------+--------+--------+
โ```
## Filter
If you don't need a full sql query, Oxen also has a lightweight `--filter` option which supports >, <, and == operations
```bash
oxen df train.csv --filter 'width > 100'
shape: (3_483, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file | label | min_x | min_y | width | height |
| --- | --- | --- | --- | --- | --- |
| str | str | f64 | f64 | f64 | f64 |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000128154.jpg | cat | 0.0 | 19.27 | 130.79 | 129.58 |
| images/000000544590.jpg | cat | 9.75 | 13.49 | 214.25 | 188.35 |
| images/000000177913.jpg | dog | 11.56 | 52.83 | 177.18 | 166.41 |
| images/000000002337.jpg | dog | 5.42 | 7.28 | 180.01 | 167.21 |
| images/000000012673.jpg | cat | 117.11 | 98.61 | 106.61 | 47.17 |
| โฆ | โฆ | โฆ | โฆ | โฆ | โฆ |
| images/000000399102.jpg | cat | 106.03 | 145.22 | 111.91 | 66.68 |
| images/000000155707.jpg | cat | 14.18 | 13.97 | 165.84 | 207.62 |
| images/000000150919.jpg | cat | 38.7 | 71.29 | 147.7 | 127.17 |
| images/000000071025.jpg | cat | 55.33 | 105.45 | 160.15 | 73.57 |
| images/000000171435.jpg | dog | 22.86 | 100.03 | 125.55 | 41.61 |
+-------------------------+-------+--------+--------+--------+--------+
## View Schema
Oxen automatically detects and versions the schema of your data frame. See the [schema docs](/concepts/schemas) for more information about this.
To view a data frame's schema in full, you can use the `--schema` flag.
```bash
oxen df train.csv --schema
+--------+-------+
| column | dtype |
+----------------+
| file | str |
|--------+-------|
| label | str |
|--------+-------|
| min_x | f64 |
|--------+-------|
| min_y | f64 |
|--------+-------|
| width | f64 |
|--------+-------|
| height | f64 |
+--------+-------+
View Specific Columns
If you only need a subset of your data frameโs columns, you can specify them in a comma separated list with --columns
.
oxen df train.csv --columns 'file,label'
shape: (9_000, 2)
+-------------------------+-------+
| file โ label |
| --- โ --- |
| str โ str |
+-------------------------+-------+
| images/000000128154.jpg โ cat |
| images/000000544590.jpg โ cat |
| images/000000000581.jpg โ dog |
| images/000000236841.jpg โ cat |
| โฆ โ โฆ |
| images/000000431980.jpg โ dog |
| images/000000071025.jpg โ cat |
| images/000000518015.jpg โ cat |
| images/000000171435.jpg โ dog |
+-------------------------+-------+
Take Indices
You can also view particular rows using --take
oxen df train.csv --take '1,13,42'
shape: (3, 6)
+-------------------------+-------+-------+-------+--------+--------+
| file โ label โ min_x โ min_y โ width โ height |
| --- โ --- โ --- โ --- โ --- โ --- |
| str โ str โ f64 โ f64 โ f64 โ f64 |
+-------------------------+-------+-------+-------+--------+--------+
| images/000000544590.jpg โ cat โ 9.75 โ 13.49 โ 214.25 โ 188.35 |
| images/000000279829.jpg โ cat โ 30.01 โ 13.58 โ 82.51 โ 176.39 |
| images/000000209289.jpg โ dog โ 72.75 โ 42.06 โ 111.52 โ 153.09 |
+-------------------------+-------+-------+-------+--------+--------+
Unique
Oxen can efficiently compute all the unique values of a given column or set of columns using the --unique
option.
oxen df train.csv --unique "file"
oxen df train.csv -u "file,label"
Concatenate (vstack)
If youโve filtered down your data and want to stack it back into a single frame. The --vstack
option takes a variable length list of files youโd like to concatenate.
oxen df train.csv --filter 'label == dog' -o /tmp/dogs.parquet
oxen df train.csv --filter 'label == cat' -o /tmp/cats.parquet
oxen df /tmp/cats.parquet --vstack /tmp/dogs.parquet -o annotations/data.parquet
Add Column
Your data might not match the schema of a data frame you want to combine with, in which case you may need to add a column to match it. You can do this and project default values with --add-col 'col:val:dtype'
oxen df train.csv --add-col 'is_cute:unknown:str'
shape: (9_000, 7)
+-------------------------+-------+--------+--------+--------+--------+---------+
| file โ label โ min_x โ min_y โ width โ height โ is_cute |
| --- โ --- โ --- โ --- โ --- โ --- โ --- |
| str โ str โ f64 โ f64 โ f64 โ f64 โ str |
+-------------------------+-------+--------+--------+--------+--------+---------+
| images/000000128154.jpg โ cat โ 0.0 โ 19.27 โ 130.79 โ 129.58 โ unknown |
| images/000000544590.jpg โ cat โ 9.75 โ 13.49 โ 214.25 โ 188.35 โ unknown |
| images/000000000581.jpg โ dog โ 49.37 โ 67.79 โ 74.29 โ 116.08 โ unknown |
| images/000000236841.jpg โ cat โ 115.21 โ 96.65 โ 93.87 โ 42.29 โ unknown |
| โฆ โ โฆ โ โฆ โ โฆ โ โฆ โ โฆ โ โฆ |
| images/000000431980.jpg โ dog โ 98.3 โ 110.46 โ 42.69 โ 26.64 โ unknown |
| images/000000071025.jpg โ cat โ 55.33 โ 105.45 โ 160.15 โ 73.57 โ unknown |
| images/000000518015.jpg โ cat โ 43.72 โ 4.34 โ 72.98 โ 129.1 โ unknown |
| images/000000171435.jpg โ dog โ 22.86 โ 100.03 โ 125.55 โ 41.61 โ unknown |
+-------------------------+-------+--------+--------+--------+--------+---------+
Add Row
You can also append new rows to the data frame. The --add-row
option takes in a comma separated list of values and automatically parses the correct dtypes.
oxen df train.csv --add-row 'images/my_cat.jpg,cat,0,0,0,0'
shape: (9_001, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file โ label โ min_x โ min_y โ width โ height |
| --- โ --- โ --- โ --- โ --- โ --- |
| str โ str โ f64 โ f64 โ f64 โ f64 |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000128154.jpg โ cat โ 0.0 โ 19.27 โ 130.79 โ 129.58 |
| images/000000544590.jpg โ cat โ 9.75 โ 13.49 โ 214.25 โ 188.35 |
| images/000000000581.jpg โ dog โ 49.37 โ 67.79 โ 74.29 โ 116.08 |
| images/000000236841.jpg โ cat โ 115.21 โ 96.65 โ 93.87 โ 42.29 |
| โฆ โ โฆ โ โฆ โ โฆ โ โฆ โ โฆ |
| images/000000071025.jpg โ cat โ 55.33 โ 105.45 โ 160.15 โ 73.57 |
| images/000000518015.jpg โ cat โ 43.72 โ 4.34 โ 72.98 โ 129.1 |
| images/000000171435.jpg โ dog โ 22.86 โ 100.03 โ 125.55 โ 41.61 |
| images/my_cat.jpg โ cat โ 0.0 โ 0.0 โ 0.0 โ 0.0 |
+-------------------------+-------+--------+--------+--------+--------+
Randomize
Often, youโll want to randomize data before splitting into train and test sets, or just to peek at different data values. This can be done with the --randomize
flag.
oxen df train.csv --randomize
shape: (9_000, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file โ label โ min_x โ min_y โ width โ height |
| --- โ --- โ --- โ --- โ --- โ --- |
| str โ str โ f64 โ f64 โ f64 โ f64 |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000124002.jpg โ cat โ 82.92 โ 8.31 โ 108.31 โ 158.48 |
| images/000000207597.jpg โ dog โ 75.64 โ 3.65 โ 125.47 โ 218.19 |
| images/000000113810.jpg โ cat โ 104.34 โ 44.65 โ 119.66 โ 159.42 |
| images/000000340160.jpg โ dog โ 79.78 โ 89.31 โ 127.1 โ 103.66 |
| โฆ โ โฆ โ โฆ โ โฆ โ โฆ โ โฆ |
| images/000000310573.jpg โ dog โ 102.55 โ 91.48 โ 42.24 โ 52.18 |
| images/000000162801.jpg โ cat โ 112.96 โ 75.05 โ 57.38 โ 98.19 |
| images/000000544117.jpg โ dog โ 108.16 โ 124.28 โ 11.08 โ 64.58 |
| images/000000283210.jpg โ dog โ 49.37 โ 40.01 โ 174.43 โ 182.0 |
+-------------------------+-------+--------+--------+--------+--------+
Sort
You can sort your data with the sort
flag. You can sort the data by the values of any column in your data frame.
oxen df train.csv --sort "height"
shape: (9_000, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file โ label โ min_x โ min_y โ width โ height |
| --- โ --- โ --- โ --- โ --- โ --- |
| str โ str โ f64 โ f64 โ f64 โ f64 |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000580919.jpg โ dog โ 61.28 โ 88.31 โ 2.71 โ 1.83 |
| images/000000577310.jpg โ dog โ 132.25 โ 193.86 โ 3.28 โ 1.95 |
| images/000000393384.jpg โ dog โ 138.85 โ 89.89 โ 1.25 โ 2.11 |
| images/000000477398.jpg โ dog โ 185.11 โ 195.93 โ 2.51 โ 2.6 |
| โฆ โ โฆ โ โฆ โ โฆ โ โฆ โ โฆ |
| images/000000069205.jpg โ dog โ 0.0 โ 0.0 โ 224.0 โ 224.0 |
| images/000000554737.jpg โ cat โ 0.0 โ 0.0 โ 224.0 โ 224.0 |
| images/000000213819.jpg โ cat โ 8.32 โ 0.0 โ 207.77 โ 224.0 |
| images/000000397212.jpg โ cat โ 0.36 โ 0.0 โ 115.5 โ 224.0 |
+-------------------------+-------+--------+--------+--------+--------+
Reverse
You can also reverse the order of a data table. By default --sort
sorts in ascending order, but this can be switched with the --reverse
flag.
oxen df train.csv --reverse
shape: (9_000, 6)
+-------------------------+-------+--------+--------+--------+--------+
| file | label | min_x | min_y | width | height |
| --- | --- | --- | --- | --- | --- |
| str | str | f64 | f64 | f64 | f64 |
+-------------------------+-------+--------+--------+--------+--------+
| images/000000397212.jpg | cat | 0.36 | 0.0 | 115.5 | 224.0 |
| images/000000213819.jpg | cat | 8.32 | 0.0 | 207.77 | 224.0 |
| images/000000554737.jpg | cat | 0.0 | 0.0 | 224.0 | 224.0 |
| images/000000069205.jpg | dog | 0.0 | 0.0 | 224.0 | 224.0 |
| images/000000242607.jpg | dog | 0.6 | 0.0 | 185.31 | 224.0 |
| โฆ | โฆ | โฆ | โฆ | โฆ | โฆ |
| images/000000371532.jpg | dog | 34.43 | 100.07 | 6.47 | 2.71 |
| images/000000477398.jpg | dog | 185.11 | 195.93 | 2.51 | 2.6 |
| images/000000393384.jpg | dog | 138.85 | 89.89 | 1.25 | 2.11 |
| images/000000577310.jpg | dog | 132.25 | 193.86 | 3.28 | 1.95 |
| images/000000580919.jpg | dog | 61.28 | 88.31 | 2.71 | 1.83 |
+-------------------------+-------+--------+--------+--------+--------+