Data Frames
Oxen provides a powerful data frame library that allows you to interact with tabular data.
Data frames are 2-dimensional tables that organize data into rows and columns like Excel spreadsheets. Whether you’re dealing with csv, parquet, or line delimited json, they can be used to help you view, edit, query, and filter your data.
Below are several examples of data frame usage with our public SpamOrHam repository, which compares spam (spam) text messages to real (ham) ones. You can use these commands to download the data and follow along.
Look At Your Data
oxen df
Oxen uses the df
command for all CLI actions involving data frames. For example, oxen df <FILENAME>
displays the contents of tabular data files.
Here, we see that SpamOrHam’s dataset consists of 4,774 rows and 2 columns. The output is automatically truncated to 10 entries. To display the entire data set, you can use the --full
flag.
You can also use oxen df
options to view your data with modifications. These changes won’t be written anywhere unless you use the --write
or --output
flags.
Uploading Data
Before modifying your data, add it to a repository to preserve its history. This can be done in the UI, Python, or CLI.
If you’ve pushed to the Oxen Hub, you can view, edit, and query your data directly using the UI.
Editing Data Frames
Once you’ve added your data to an Oxen repository, you can interact with data frames even if they’re not downloaded locally. Oxen exposes a CRUD interface that makes this possible.
All of these operations are exposed over HTTP, so you are not limited to using the Python library. Check out all our HTTP reference docs to see how to interact with your data programatically.
You can also edit data files locally with oxen df --write
. Any modifications you make with this flag set will be written back to the original file and register as ‘modified’ in your Oxen repository.
Oxen uses a combination of polars and duckdb under the hood, and uses the Apache Arrow data format to provide powerful cross application functionality.
Useful Commands
There are many ways you might want to view, transform, and filter your data on the command line before committing changes to the dataset. oxen df
provides several options that can help with this.
For these examples, we’ll use our CatDogBBox repository.
Convert Dataset Format
Oxen allows you to quickly transform data files between data formats. When you run oxen df
with --output
, the resulting data frame will be written to disk as a new file of the specified type.
Some formats like parquet and arrow are more efficient for different tasks, but are not human readable like tsv or csv. These are tradeoffs you’ll have to decide on for your application. Oxen currently supports the following file extensions: csv
, tsv
, parquet
, arrow
, json
, jsonl
.
SQL Query
Oxen has a powerful SQL query engine built in to the CLI. You can run SQL queries on your data frames with the —sql flag.
View Specific Columns
If you only need a subset of your data frame’s columns, you can specify them in a comma separated list with --columns
.
Take Indices
You can also view particular rows using --take
Unique
Oxen can efficiently compute all the unique values of a given column or set of columns using the --unique
option.
Concatenate (vstack)
If you’ve filtered down your data and want to stack it back into a single frame. The --vstack
option takes a variable length list of files you’d like to concatenate.
Add Column
Your data might not match the schema of a data frame you want to combine with, in which case you may need to add a column to match it. You can do this and project default values with --add-col 'col:val:dtype'
Add Row
You can also append new rows to the data frame. The --add-row
option takes in a comma separated list of values and automatically parses the correct dtypes.
Randomize
Often, you’ll want to randomize data before splitting into train and test sets, or just to peek at different data values. This can be done with the --randomize
flag.
Sort
You can sort your data with the sort
flag. You can sort the data by the values of any column in your data frame.
Reverse
You can also reverse the order of a data table. By default --sort
sorts in ascending order, but this can be switched with the --reverse
flag.