๐ Compute Text Embeddings
How to compute vector embeddings for a text dataset on a GPU.
Embeddings are a way to represent text in a numerical format as vectors. They are used in a variety of applications, including search and retrieval, clustering, labeling data and anomaly detection.
Notebooks make it easy and fast to compute embeddings for a dataset on a GPU. If you want to follow along, you can checkout this notebook and run it in your own Oxen.ai account. When running this example, try an A10 GPU with 4GB of memory and 4 CPU cores. This will allow us to compute over 1,000 embeddings per second ๐ฅ
Setting Up The Interface
Marimo allows you to define UI elements that can be used to define the input repository, dataset, model name and number of rows to compute embeddings for. First lets setup a simple form that allows us to kick off the embedding computation.
Use the following code in your first cell to setup the UI.
To wait for the button to be clicked, use the mo.stop
function and check if the run_form.value
is None
.
Then download the data using the values from the form and the Remote Repo class.
Compute Embeddings
This example will use the sentence_transformers
library to compute the embeddings with the default model as BAAI/bge-large-en-v1.5
. Find more information about the model here.
Now we can compute the embeddings for the dataset. We will compute them in batches to take full advantage of the GPU. In this example, we are just computing the embeddings for the title
column, but you can compute the embeddings for any text column in the dataset. mo.status.progress_bar
is used to show a progress bar in the UI as we compute the embeddings.
You should see the model computing over 1,000 embeddings per second ๐ฅ
The embeddings will now be in the result_df
data frame in a new column called embedding
.
Save the Embeddings
Once you have computed the embeddings, save them to your Oxen.ai repository to share with your team. Oxen.ai will version the embeddings and allow you to track changes so that you can try out different models and configurations without worrying about losing your previous work.
Search Nearest Neighbors
To check how well the embeddings encode the text, letโs build a little search tool. We will use cosine_similarity
from sklearn
to build a simple nearest neighbor search.
Now we can use the embedding_similarity
function to search for the nearest neighbors of a query.
Build a text input so that we can enter any term we want and see similar titles.