Extracting Text Embeddings and Nearest Neighbors Search

Modern Natural Language Processing (NLP) techniques often require transforming textual data into numerical vectors, known as embeddings. These embeddings capture the semantic meaning of the text. Once in this form, various machine learning algorithms can process the data, and one of the most common tasks is finding texts that are semantically similar.

Some very good articles on embeddings are mentioned below.

In this chapter, we'll explore how to use the sentence_transformers library to generate embeddings for chunks of text. After that, we'll delve into how you can use scikit-learn's Nearest Neighbors module to find similar text chunks based on their embeddings.

Extracting Embeddings using Sentence Transformers ====

The `sentence_transformers`` library provides an easy and efficient way to convert sentences into embeddings.

Setting up:


from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

Here, we've initialized a Sentence Transformer model called all-MiniLM-L6-v2. There are various models available, and you can choose the one that best suits your requirements.

Embedding Extraction:


embeddings = model.encode(chunk_text)

The encode method takes a chunk of text and returns its embedding. This embedding can be saved for future use.

Using Scikit-Learn's Nearest Neighbors for Similarity Search ====

Before diving into Nearest Neighbors, let's understand the purpose.

Given a chunk of text, you might want to find other text chunks that are semantically similar to it. This is where the Nearest Neighbors algorithm comes into play.

Setting up:

To use the Nearest Neighbors module, first, you need to install and import it:


from sklearn.neighbors import NearestNeighbors


# Assuming embeddings_list holds all the embeddings
X = [item[1] for item in embeddings_list]

nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(X)
# Here, the model will return the five most similar text chunks when queried.


#To find text chunks similar to a given chunk:


distances, indices = nbrs.kneighbors([model.encode("Your input text here")])
# This will give you the indices of the text chunks that are most similar to the input text.

Putting it all together:



import sqlite3
import pickle
from sentence_transformers import SentenceTransformer
from sklearn.neighbors import NearestNeighbors

# 1. Read the text chunks from the SQLite database

# Connect to the SQLite database
conn = sqlite3.connect('../chunks.db')
cursor = conn.cursor()

# Select all rows from the pdf_chunks table
# We will sample the first 100 for the class
cursor.execute("SELECT chunk_id, chunk_text FROM pdf_chunks limit 100")
rows = cursor.fetchall()

conn.close()

# 2. Convert these chunks into embeddings

model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings_list = []

for row in rows:
    chunk_id = row[0]
    chunk_text = row[1]
    embeddings = model.encode(str(chunk_text))
    embeddings_list.append((chunk_id, chunk_text, embeddings))

# For the sake of the example, let's save the embeddings to a file (optional)
with open('embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings_list, f)

# 3. Use scikit-learn's Nearest Neighbors to search for similar chunks

# First build the model
X = [item[2] for item in embeddings_list]
nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(X)

# Function to search for similar text chunks
def find_similar_chunks(input_text):
    distances, indices = nbrs.kneighbors([model.encode(input_text)])
    return [embeddings_list[i] for i in indices[0]]
    

# Test
input_chunk = "prompting techniques"
similar_chunks = find_similar_chunks(input_chunk)
print("Text chunks similar to:", input_chunk)
for chunk in similar_chunks:
    print(chunk[1])

Conclusion

Text embeddings have revolutionized the way we handle and process textual data.

By converting text into numerical form, not only can we better understand the semantic meaning behind words and sentences, but we can also apply machine learning algorithms to derive insights, group similar content, and more.