Embeddings with sqlite-vector

About a year and a half ago I wrote about using sqlite-vss to store and query embedding vectors in a SQLite database. Much has changed since then and I’m working on a project that motivated another pass at querying embeddings on a local system for smallish datasets. The sqlite-vector project seemed like an interesting one to try for this purpose.

I am going to use the same news dataset as last time and the nomic-embed-text-v1.5 model to generate 768-dimensional embeddings.

I also downloaded the vector.dylib file from the sqlite-vector GitHub repo and placed it in my working directory for this example. I’ve tried exercises similar to this one with both the macOS and Linux versions of the library.

To get started, we’ll install the libraries we will need to load the data, create the database and generate embeddings.

pip install -q pandas scikit-learn sentence-transformers

Next, we’ll create the database and table and load the sqlite-vector extension, verifying that it loaded correctly.

conn = sqlite3.connect('news.db')conn.enable_load_extension(True)conn.load_extension('./vector.dylib')conn.enable_load_extension(False)version = conn.execute("SELECT vector_version()").fetchone()[0]print(f"SQLite Vector extension version: {version}")

SQLite Vector extension version: 0.9.37

Looks good! The query shows the same version I downloaded.

If you haven’t taken a look at the dataset yet, here’s a sample:

# Load the first few records to see the structuredata_path = Path('data/News_Category_Dataset_v3.json')with open(data_path, 'r') as f: for i, line in enumerate(f): articles.append(json.loads(line))df = pd.DataFrame(articles)

From here, we’ll use the same database connection to create a table to store the news dataset as well as embeddings for both the short_description and headline fields of the dataset.

I’m only using the first 1000 entries because it takes a few minutes to compute all the embeddings.

 CREATE TABLE news_articles ( id INTEGER PRIMARY KEY AUTOINCREMENT, description_embedding BLOB

for _, article in df.head(1000).iterrows(): INSERT INTO news_articles (link, headline, category, short_description, authors, date) VALUES (?, ?, ?, ?, ?, ?) article['short_description'],count = conn.execute("SELECT COUNT(*) FROM news_articles").fetchone()[0]print(f"Loaded {count:,} articles")print("\nSample articles:")for row in conn.execute("SELECT id, headline, category FROM news_articles LIMIT 5"): print(f" [{row[2]}] {row[1][:70]}...")

 [U.S. NEWS] Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Bo... [U.S. NEWS] American Airlines Flyer Charged, Banned For Life After Punching Flight... [COMEDY] 23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)... [PARENTING] The Funniest Tweets From Parents This Week (Sept. 17-23)... [U.S. NEWS] Woman Who Called Cops On Black Bird-Watcher Loses Lawsuit Against Ex-E...

With the source data loaded, we’ll now generate the embeddings for the headline and short_description fields, using the nomic-embed-text-v1.5 model and store those in the database as well.

from sentence_transformers import SentenceTransformermodel = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5', trust_remote_code=True)print(f"Model loaded (embedding dimension: {model.get_sentence_embedding_dimension()})")

Model loaded (embedding dimension: 768)

articles = conn.execute(""" SELECT id, headline, short_descriptionfor i in range(0, len(articles), batch_size): batch = articles[i:i + batch_size] article_ids = [article[0] for article in batch] headlines = [article[1] for article in batch] descriptions = [article[2] if article[2] else "" for article in batch] headline_embeddings = model.encode(headlines, normalize_embeddings=True) description_embeddings = model.encode(descriptions, normalize_embeddings=True) # Store in database as BLOB for article_id, h_emb, d_emb in zip(article_ids, headline_embeddings, description_embeddings): "UPDATE news_articles SET headline_embedding = ?, description_embedding = ? WHERE id = ?", (h_blob, d_blob, article_id) # Commit and show progress every 4000 articles (helpful if you actually load all the data) if (i + batch_size) % 4000 == 0 or (i + batch_size) >= len(articles): print(f"Processed {min(i + batch_size, len(articles)):,}/{len(articles):,} articles")

Processed 1,000/1,000 articles

With the embeddings generated and inserted into the database, we’ll now initialize the vector search for both the headline and short_description fields. According to the documentation

Only tables explicitly initialized using vector_init are eligible for vector search.

So let’s do that:

 'type=FLOAT32,dimension=768,distance=COSINE' 'type=FLOAT32,dimension=768,distance=COSINE'

We could also quantize the vector for faster and more efficient search, but let’s save that for another time.

With our embeddings generated and initialized, we can now define the search functions over the two fields. Before we do though, we’ll briefly discuss how the search functions work. We start by taking query_text and encoding it using the same model we used to generate the embeddings.

encoded = model.encode(["hello world!"])[0]encoded[:20] # truncated for brevity

array([ 0.12390442, -0.06160246, -3.9840953 , -0.24632797, -0.3041943 , 1.5918531 , -0.20516105, -0.6006584 , -0.22573042, -1.319676 , 0.31299376, 1.2585417 , 0.49875298, 1.381727 , 0.5602437 , -1.0751574 , 0.27402204, -1.1879947 , -0.98643625, 0.46151334],

Specifically, the size of the vector is equal to the number of dimensions of the embedding model.

Given this approach, let’s show how we can calculate the distance of a query from one of the headlines in the dataset. As we specified in vector_init, we will use cosine distance (distance=COSINE).

from sklearn.metrics.pairwise import cosine_similarityheadline = "New York City To Let Unvaccinated Athletes Play Home Games"query_embedding = model.encode([query], normalize_embeddings=True)headline_embedding = model.encode([headline], normalize_embeddings=True)similarity = cosine_similarity(query_embedding, headline_embedding)

Back to analyzing our dataset — since I normalized the embeddings when I encoded them above, we’ll also need to do that here. Finally, we’ll get the output as an array that we can convert to a blob (our column data type of the embeddings) and pass it to the vector_full_scan function.

f"{model.encode(["hello world!"], normalize_embeddings=True).tobytes()[:20]}..."

"b'\\r-\\xb4;\\xa8(3\\xbb\\xdf\\x0b5\\xbej\\x193\\xbc*,]\\xbc'..."

def search_by_headline(conn, query_text, top_k=5): query_embedding = model.encode([query_text], normalize_embeddings=True) query_blob = query_embedding.tobytes() results = conn.execute(""" SELECT a.id, a.headline, a.category, a.short_description, v.distance JOIN vector_full_scan('news_articles', 'headline_embedding', ?, ?) AS v """, (query_blob, top_k)).fetchall()def search_by_description(conn, query_text, top_k=5): query_embedding = model.encode([query_text], normalize_embeddings=True) query_blob = query_embedding.tobytes() results = conn.execute(""" SELECT a.id, a.headline, a.category, a.short_description, v.distance JOIN vector_full_scan('news_articles', 'description_embedding', ?, ?) AS v """, (query_blob, top_k)).fetchall()

Let’s run some semantic queries and see what we find! This should look familiar — we’re using the same query we embedded and calculated cosine distance for above.

headline_results = search_by_headline(conn, query, top_k=3)for i, (article_id, headline, category, description, distance) in enumerate(headline_results, 1): print(f"{i}. [{category}] {headline} (distance: {distance:.4f})")

1. [U.S. NEWS] New York City To Let Unvaccinated Athletes Play Home Games (distance: 0.3429)2. [SPORTS] Video Shows NFL Player Hopping Into Stands To Stop Fighting Fans (distance: 0.3828)3. [POLITICS] NBA Won't Host Election Day Games In Effort To Increase Voter Turnout (distance: 0.4232)

We see an identical result as we calculated above for the first (and closest) headline match (distance: 0.3429).

Now we can run a similar query on the description field:

description_results = search_by_description(conn, query, top_k=3)for i, (article_id, headline, category, description, distance) in enumerate(description_results, 1): print(f"{i}. [{category}] {headline} (distance: {distance:.4f})") print(f" Description: {description}")

1. [SPORTS] Las Vegas Aces Win First WNBA Title, Chelsea Gray Named MVP (distance: 0.3580) Description: Las Vegas never had a professional sports champion — until Sunday.2. [SPORTS] Lucius Fox Makes Sickening Exit Just 2 Pitches Into Game (distance: 0.3966) Description: The Washington Nationals infielder became a hurler for one gross moment — which was caught on video — against the San Francisco Giants.3. [SPORTS] Diego Maradona's 'Hand Of God' Jersey Sells For Record Sum (distance: 0.4160) Description: The Argentine soccer superstar wore the shirt when he scored the controversial goal against England in the 1986 World Cup.

headline_results = search_by_headline(conn, "artificial intelligence", top_k=3)description_results = search_by_description(conn, "artificial intelligence", top_k=3)for i, (article_id, headline, category, description, distance) in enumerate(headline_results, 1): print(f"{i}. [{category}] {headline} (distance: {distance:.4f})")for i, (article_id, headline, category, description, distance) in enumerate(description_results, 1): print(f"{i}. [{category}] {description} (distance: {distance:.4f})")

1. [TECH] Google Engineer On Leave After He Claims AI Program Has Gone Sentient (distance: 0.3774)2. [CULTURE & ARTS] How Ani Liu Is Brilliantly Disguising Her Art As Science (distance: 0.5151)3. [ENTERTAINMENT] Andrew Garfield Confirms Method Acting Is Possible Without 'Being An Asshole' (distance: 0.5217)1. [TECH] Artificially intelligent chatbot generator LaMDA wants “to be acknowledged as an employee of Google rather than as property," says engineer Blake Lemoine. (distance: 0.4230)2. [POLITICS] The House Intelligence Committee hearing follows a 2021 report of a possible national security "challenge" from UFOs. (distance: 0.4950)3. [U.S. NEWS] The American Academy of Pediatrics says it is putting all its guidance under the microscope to eliminate “race-based” medicine and resulting health disparities. (distance: 0.4967)

Embeddings and cosine distance allow us to do search over the headline and short_description fields of the dataset, finding semantically similar results even if the results themselves don’t contain or match the query text or parts of the query text exactly.

I spent yesterday and today working through the excellent guide by Alex on using sqlite-vss to do vector similarity search in a SQLite database. I'm...

I explored how embeddings cluster by visualizing LLM-generated words across different categories. The visualizations helped build intuition about how...

The standard SQLite shell on macOS doesn't support arrow key navigation like many standard CLI programs do. Pressing up, down, right, and left in...