oxen.datasets

load_dataset

def load_dataset(repo_id: str,
                 path: str,
                 fmt: str = "hugging_face",
                 revision=None)

Load a dataset from an Oxen repository into memory using the HuggingFace datasets library.

Arguments:

  • repo_id - str The namespace/repo_name of the oxen repository to load the dataset from
  • path - str | Sequence[str] The path to the dataset we want to load
  • fmt - str The format of the data files. Currently only “hugging_face” is supported.
  • revision - str | None The commit id or branch name of the version of the data to download

Example:

from oxen.datasets import load_dataset
dataset = load_dataset("datasets/gsm8k", "train.jsonl")
# use datasets functions as you normally would
dataset.shuffle()[:10]

download

def download(repo_id: str,
             path: str,
             revision=None,
             dst=None,
             host="hub.oxen.ai",
             scheme="https")

Download files or directories from a remote Oxen repository.

Arguments:

  • repo_id - str The namespace/repo_name of the oxen repository to load the dataset from
  • path - str The path to the data files
  • revision - str | None The commit id or branch name of the version of the data to download
  • dst - str | None The path to download the data to.
  • host - str The host to download the data from.
  • scheme - str The scheme to download the data with. (default: “https”)

upload

def upload(repo_id: str,
           path: str,
           message: str,
           branch: Optional[str] = None,
           dst: str = "")

Upload files or directories to a remote Oxen repository.

Arguments:

  • repo_id - str The namespace/repo_name of the oxen repository to upload the dataset to
  • path - str The path to the data files
  • message - str The commit message to use when uploading the data
  • branch - str | None The branch to upload the data to. If None, the main branch is used.
  • dst - str | None The directory to upload the data to.