Documentation Index
Fetch the complete documentation index at: https://docs.oxen.ai/llms.txt
Use this file to discover all available pages before exploring further.
oxen.data_frame
DataFrame Objects
The DataFrame class allows you to perform CRUD operations on a remote data frame.
If you pass in a Workspace or a RemoteRepo the data is indexed into DuckDB on an oxen-server without downloading the data locally.
Examples
CRUD Operations
Index a data frame in a workspace.
from oxen import DataFrame
# Connect to and index the data frame
# Note: This must be an existing file committed to the repo
# indexing may take a while for large files
data_frame = DataFrame("datasets/SpamOrHam", "data.tsv")
# Add a row
row_id = data_frame.insert_row({"category": "spam", "message": "Hello, do I have an offer for you!"})
# Get a row by id
row = data_frame.get_row_by_id(row_id)
print(row)
# Update a row
row = data_frame.update_row(row_id, {"category": "ham"})
print(row)
# Delete a row
data_frame.delete_row(row_id)
# Get the current changes to the data frame
status = data_frame.diff()
print(status.added_files())
# Commit the changes
data_frame.commit("Updating data.csv")
__init__
def __init__(remote: Union[str, RemoteRepo, Workspace],
path: str,
host: str = "hub.oxen.ai",
branch: Optional[str] = None,
scheme: str = "https",
workspace_name: Optional[str] = None)
Initialize the DataFrame class. Will index the data frame
into duckdb on init.
Will throw an error if the data frame does not exist.
Arguments:
remote - str, RemoteRepo, or Workspace
The workspace or remote repo the data frame is in.
path - str
The path of the data frame file in the repository.
host - str
The host of the oxen-server. Defaults to “hub.oxen.ai”.
branch - Optional[str]
The branch of the remote repo. Defaults to None.
scheme - str
The scheme of the remote repo. Defaults to “https”.
workspace_url
def workspace_url(host: str = "oxen.ai", scheme: str = "https") -> str
Get the url of the data frame.
size
def size() -> tuple[int, int]
Get the size of the data frame. Returns a tuple of (rows, columns)
page_size
Get the page size of the data frame for pagination in list() command.
Returns:
The page size of the data frame.
total_pages
Get the total number of pages in the data frame for pagination in list() command.
Returns:
The total number of pages in the data frame.
list_page
def list_page(page_num: int = 1) -> List[dict]
List the rows within the data frame.
Arguments:
page_num - int
The page number of the data frame to list. We default to page size of 100 for now.
Returns:
A list of rows from the data frame.
insert_row
def insert_row(data: dict)
Insert a single row of data into the data frame.
Arguments:
data - dict
A dictionary representing a single row of data.
The keys must match a subset of the columns in the data frame.
If a column is not present in the dictionary,
it will be set to an empty value.
Returns:
The id of the row that was inserted.
where_sql_from_dict
def where_sql_from_dict(attributes: dict, operator: str = "AND") -> str
Generate the SQL from the attributes.
select_sql_from_dict
def select_sql_from_dict(attributes: dict,
columns: Optional[List[str]] = None) -> str
Generate the SQL from the attributes.
get_embeddings
def get_embeddings(attributes: dict, column: str = "embedding") -> List[float]
Get the embedding from the data frame.
is_nearest_neighbors_enabled
def is_nearest_neighbors_enabled(column="embedding")
Check if the embeddings column is indexed in the data frame.
enable_nearest_neighbors
def enable_nearest_neighbors(column: str = "embedding")
Index the embeddings in the data frame.
query
def query(sql: Optional[str] = None,
find_embedding_where: Optional[dict] = None,
embedding: Optional[list[float]] = None,
sort_by_similarity_to: Optional[str] = None,
page_num: int = 1,
page_size: int = 10)
Sort the data frame by the embedding.
nearest_neighbors_search
def nearest_neighbors_search(find_embedding_where: dict,
sort_by_similarity_to: str = "embedding")
Get the nearest neighbors to the embedding.
get_by
def get_by(attributes: dict)
Get a single row of data by attributes.
get_row
Get a single row of data by index.
Arguments:
idx - int
The index of the row to get.
Returns:
A dictionary representing the row.
get_row_by_id
def get_row_by_id(id: str)
Get a single row of data by id.
Arguments:
id - str
The id of the row to get.
Returns:
A dictionary representing the row.
update_row
def update_row(id: str, data: dict)
Update a single row of data by id.
Arguments:
id - str
The id of the row to update.
data - dict
A dictionary representing a single row of data.
The keys must match a subset of the columns in the data frame.
If a column is not present in the dictionary,
it will be set to an empty value.
Returns:
The updated row as a dictionary.
delete_row
Delete a single row of data by id.
Arguments:
id - str
The id of the row to delete.
restore
Unstage any changes to the schema or contents of a data frame
commit
def commit(message: str, branch: Optional[str] = None)
Commit the current changes to the data frame.
Arguments:
message - str
The message to commit the changes.
branch - str
The branch to commit the changes to. Defaults to the current branch.