Nearest Neighbor Search
Nearest neighbor search is a common use case for embeddings. This allows you to sort data by similarity to a query. This is useful for search and retrieval tasks where you can’t rely on exact string matches. For example, say you want to find all the rows that have a similar title to “Wild yak”.
title
column, you will see all of the top results have something in common with wild animals. The forth result is Musk ox, which is another large bovine animal 🐂. In the similarity
column, you will see a score between 0 and 1 that indicates how similar the two vectors are.
How it works
Embeddings are an abstraction of the data, represented as vectors of floating point numbers. You can perform efficient nearest neighbor searches on these vectors to see which vectors are closest to your query. To give you an idea of how this works, let’s break down each one of the parameters in the command above.—find-embedding-where
The first step is to pick a row that we want to find similar rows to. We want to pluck out the embedding for this row so that we can compare all of the other rows to it. This parameter is simply the SQLWHERE
clause that we use to filter the rows that we want to find similar rows to.
WHERE
clause here, using an id or primary key is a good idea. If there are multiple rows that match the query, the embeddings will be averaged together.
—sort-by-similarity-to
In the above example, you will see that we need to specify the column that contains the embeddings. This parameter tells oxen which column to grab the embeddings from as well as the column to sort by. For example, if we want to sort on thetitle_embeddings
column the underlying SQL that is generated will look like this.
—workspace-id
The workspace is where the embeddings are stored. It contains the vector index that is used for the nearest neighbor search.Putting it all together
If you don’t already have a dataset with embeddings, either compute them yourself or download one of our example datasets.
Create a workspace
In order to use embeddings, you will need to create a workspace. Workspaces allow you to query and edit versions of the data without immediately committing your changes. Oxen uses DuckDB to store your embeddings and data. If you haven’t already created an Oxen repository, you should create a new one to get started.Index your embeddings
Once you have a workspace, you can then index anycsv
, parquet
, or jsonl
file into DuckDB. If the file contains embeddings, you can specify the column name with the embeddings.
oxen df index
without the --embeddings
flag will just index the data into DuckDB so that you can query it with SQL, but will not enable nearest neighbor search. When you pass in the --embeddings
flag, oxen will automatically run the following SQL commands to enable nearest neighbor search.
Query embeddings
Now that the embeddings have been indexed, you can query them.