oxen.data_frame

DataFrame Objects

class DataFrame()

The DataFrame class allows you to perform CRUD operations on a remote data frame.

If you pass in a Workspace or a RemoteRepo the data is indexed into DuckDB on an oxen-server without downloading the data locally.

Examples

CRUD Operations

Index a data frame in a workspace.

from oxen import DataFrame

# Connect to and index the data frame
# Note: This must be an existing file committed to the repo
#       indexing may take a while for large files
data_frame = DataFrame("datasets/SpamOrHam", "data.tsv")

# Add a row
row_id = data_frame.insert_row({"category": "spam", "message": "Hello, do I have an offer for you!"})

# Get a row by id
row = data_frame.get_row_by_id(row_id)
print(row)

# Update a row
row = data_frame.update_row(row_id, {"category": "ham"})
print(row)

# Delete a row
data_frame.delete_row(row_id)

# Get the current changes to the data frame
status = data_frame.diff()
print(status.added_files())

# Commit the changes
data_frame.commit("Updating data.csv")

init

def __init__(remote: Union[str, RemoteRepo, Workspace],
             path: str,
             host: str = "hub.oxen.ai",
             branch: Optional[str] = None,
             scheme: str = "https",
             workspace_name: Optional[str] = None)

Initialize the DataFrame class. Will index the data frame into duckdb on init.

Will throw an error if the data frame does not exist.

Arguments:

remote - str, RemoteRepo, or Workspace The workspace or remote repo the data frame is in.
path - str The path of the data frame file in the repository.
host - str The host of the oxen-server. Defaults to “hub.oxen.ai”.
branch - Optional[str] The branch of the remote repo. Defaults to None.
scheme - str The scheme of the remote repo. Defaults to “https”.

workspace_url

def workspace_url(host: str = "oxen.ai", scheme: str = "https") -> str

Get the url of the data frame.

size

def size() -> tuple[int, int]

Get the size of the data frame. Returns a tuple of (rows, columns)

page_size

def page_size() -> int

Get the page size of the data frame for pagination in list() command.

Returns:

The page size of the data frame.

total_pages

def total_pages() -> int

Get the total number of pages in the data frame for pagination in list() command.

Returns:

The total number of pages in the data frame.

list_page

def list_page(page_num: int = 1) -> List[dict]

List the rows within the data frame.

Arguments:

page_num - int The page number of the data frame to list. We default to page size of 100 for now.

Returns:

A list of rows from the data frame.

insert_row

def insert_row(data: dict)

Insert a single row of data into the data frame.

Arguments:

data - dict A dictionary representing a single row of data. The keys must match a subset of the columns in the data frame. If a column is not present in the dictionary, it will be set to an empty value.

Returns:

The id of the row that was inserted.

where_sql_from_dict

def where_sql_from_dict(attributes: dict, operator: str = "AND") -> str

Generate the SQL from the attributes.

select_sql_from_dict

def select_sql_from_dict(attributes: dict,
                         columns: Optional[List[str]] = None) -> str

Generate the SQL from the attributes.

get_embeddings

def get_embeddings(attributes: dict, column: str = "embedding") -> List[float]

Get the embedding from the data frame.

is_nearest_neighbors_enabled

def is_nearest_neighbors_enabled(column="embedding")

Check if the embeddings column is indexed in the data frame.

enable_nearest_neighbors

def enable_nearest_neighbors(column: str = "embedding")

Index the embeddings in the data frame.

query

def query(sql: Optional[str] = None,
          find_embedding_where: Optional[dict] = None,
          embedding: Optional[list[float]] = None,
          sort_by_similarity_to: Optional[str] = None,
          page_num: int = 1,
          page_size: int = 10)

Sort the data frame by the embedding.

nearest_neighbors_search

def nearest_neighbors_search(find_embedding_where: dict,
                             sort_by_similarity_to: str = "embedding")

Get the nearest neighbors to the embedding.

get_by

def get_by(attributes: dict)

Get a single row of data by attributes.

get_row

def get_row(idx: int)

Get a single row of data by index.

Arguments:

idx - int The index of the row to get.

Returns:

A dictionary representing the row.

get_row_by_id

def get_row_by_id(id: str)

Get a single row of data by id.

Arguments:

id - str The id of the row to get.

Returns:

A dictionary representing the row.

update_row

def update_row(id: str, data: dict)

Update a single row of data by id.

Arguments:

id - str The id of the row to update.
data - dict A dictionary representing a single row of data. The keys must match a subset of the columns in the data frame. If a column is not present in the dictionary, it will be set to an empty value.

Returns:

The updated row as a dictionary.

delete_row

def delete_row(id: str)

Delete a single row of data by id.

Arguments:

id - str The id of the row to delete.

restore

def restore()

Unstage any changes to the schema or contents of a data frame

commit

def commit(message: str, branch: Optional[str] = None)

Commit the current changes to the data frame.

Arguments:

message - str The message to commit the changes.
branch - str The branch to commit the changes to. Defaults to the current branch.

Python API

Data frame

oxen.data_frame

DataFrame Objects

Examples

CRUD Operations

init

workspace_url

size

page_size

total_pages

list_page

insert_row

where_sql_from_dict

select_sql_from_dict

get_embeddings

is_nearest_neighbors_enabled

enable_nearest_neighbors

query

nearest_neighbors_search

get_by

get_row

get_row_by_id

update_row

delete_row

restore

commit

Python API

​oxen.data_frame

​DataFrame Objects

​Examples

​CRUD Operations

​__init__

​workspace_url

​size

​page_size

​total_pages

​list_page

​insert_row

​where_sql_from_dict

​select_sql_from_dict

​get_embeddings

​is_nearest_neighbors_enabled

​enable_nearest_neighbors

​query

​nearest_neighbors_search

​get_by

​get_row

​get_row_by_id

​update_row

​delete_row

​restore

​commit

oxen.data_frame

DataFrame Objects

Examples

CRUD Operations

init

workspace_url

size

page_size

total_pages

list_page

insert_row

where_sql_from_dict

select_sql_from_dict

get_embeddings

is_nearest_neighbors_enabled

enable_nearest_neighbors

query

nearest_neighbors_search

get_by

get_row

get_row_by_id

update_row

delete_row

restore

commit