๐ Embeddings Search
Embeddings are a way to represent data as vectors in a way that can be used for machine learning tasks like search, clustering, classification, and more.
Nearest Neighbor Search
Nearest neighbor search is a common use case for embeddings. This allows you to sort data by similarity to a query. This is useful for search and retrieval tasks where you canโt rely on exact string matches.
For example, say you want to find all the rows that have a similar title to โWild yakโ.
This is a advanced command that we will break down step by step in the next sections. For now, letโs look at the results.
If you look in the title
column, you will see all of the top results have something in common with wild animals. The forth result is Musk ox, which is another large bovine animal ๐. In the similarity
column, you will see a score between 0 and 1 that indicates how similar the two vectors are.
How it works
Embeddings are an abstraction of the data, represented as vectors of floating point numbers. You can perform efficient nearest neighbor searches on these vectors to see which vectors are closest to your query.
To give you an idea of how this works, letโs break down each one of the parameters in the command above.
โfind-embedding-where
The first step is to pick a row that we want to find similar rows to. We want to pluck out the embedding for this row so that we can compare all of the other rows to it.
This parameter is simply the SQL WHERE
clause that we use to filter the rows that we want to find similar rows to.
You can use any SQL WHERE
clause here, using an id or primary key is a good idea. If there are multiple rows that match the query, the embeddings will be averaged together.
โsort-by-similarity-to
In the above example, you will see that we need to specify the column that contains the embeddings. This parameter tells oxen which column to grab the embeddings from as well as the column to sort by.
For example, if we want to sort on the title_embeddings
column the underlying SQL that is generated will look like this.
โworkspace-id
The workspace is where the embeddings are stored. It contains the vector index that is used for the nearest neighbor search.
Putting it all together
If you donโt already have a dataset with embeddings, either compute them yourself or download one of our example datasets.
Grab these precomputed embeddings with the following command.
Create a workspace
In order to use embeddings, you will need to create a workspace. Workspaces allow you to query and edit versions of the data without immediately committing your changes. Oxen uses DuckDB to store your embeddings and data.
If you havenโt already created an Oxen repository, you should create a new one to get started.
Now we have our embeddings committed to the repository. We can create a workspace to query the data. A workspace is based off of a branch and links directly to a version of a dataset at a commit. If you want to learn more about workspaces, check out the workspaces page.
Create a workspace and give it a name.
To see which workspaces have been created, you can list them.
Index your embeddings
Once you have a workspace, you can then index any csv
, parquet
, or jsonl
file into DuckDB. If the file contains embeddings, you can specify the column name with the embeddings.
Note: oxen df index
without the --embeddings
flag will just index the data into DuckDB so that you can query it with SQL, but will not enable nearest neighbor search. When you pass in the --embeddings
flag, oxen will automatically run the following SQL commands to enable nearest neighbor search.
Query embeddings
Now that the embeddings have been indexed, you can query them.
This will string together all the underlying SQL queries and do the heavy lifting to give you a set of sorted results.
Query with SQL
Now that the data is indexed into a workspace, you can also query the data with raw SQL.
Workspaces are power tools once you wrap your head around them. They allow you to build some really interesting exploratory data analysis, labeling workflows, and search pipelines. Using nearest neighbor search with embeddings is a great way to sift through large datasets, prototype RAG pipelines, and test different embeddings models.
If you want to see the underlying HTTP request that is being made, checkout the API reference.