Retrieval-Augmented Generation using Language Models

The previous chapters equipped you with the skills to extract, chunk, and embed information from diverse PDF sources.

The essence of AI-driven solutions lies not just in the retrieval of data but, more importantly, in its effective interpretation. This is where advanced language models (LLMs), like OpenAI's GPT variants, come into play.

The Power of Retrieval-Augmented Generation (RAG)

Before diving into the code, it's crucial to grasp the paradigm of Retrieval-Augmented Generation (RAG). RAG seamlessly combines the strengths of two worlds:

Retrieval Systems: These systems efficiently retrieve chunks of relevant data based on vector similarity, like Annoy.
Generative Language Models (LLM): These are sophisticated models that can generate human-like text based on given prompts, such as GPT-3.5.

When united, retrieval systems pull the most relevant data, and generative LLMs interpret this data, providing insightful responses or interpretations. This combo is potent for a myriad of applications like question answering, document summarization, and more.

Leveraging LLM for RAG: A Deep Dive

The provided code walks you through how to combine the power of retrieval systems with LLMs for a comprehensive solution:

1. Setting up Dependencies

from annoy import AnnoyIndex
import sqlite3
from sentence_transformers import SentenceTransformer
import openai
from string import Template

openai.api_key = "YOUR_API_KEY"

Here, we're importing all the necessary libraries. Notice AnnoyIndex for retrieval, SentenceTransformer for embedding, and openai for leveraging the LLM.

2. Loading the Sentence Transformer Model


model = SentenceTransformer('all-MiniLM-L6-v2')
VEC_INDEX_DIM = 384

The sentence transformer model is loaded to generate embeddings for any new queries we might want to make.

3. Querying the Vector Index


u = AnnoyIndex(VEC_INDEX_DIM, 'angular')
u.load("/path_to_vector_index/vecindex.ann")

query_text = "your_query_here"
embedding = model.encode([query_text])
input_vec = embedding[0]

chunk_ids = u.get_nns_by_vector(input_vec, 10, search_k=-1, include_distances=False)

Here, we load the pre-constructed Annoy index and fetch the top 10 similar text chunks to our query_text based on vector similarity.

4. Fetching Relevant Text Chunks


con = sqlite3.connect("/path_to_database/chunks.db")
cur = con.cursor()

list_chunk_ids = ','.join([str(k) for k in chunk_ids])
cur.execute("select chunk_id, chunk_text from pdf_chunks where chunk_id in (" + list_chunk_ids + ")")
res = cur.fetchall()
res_docs = '\n'.join([k[1] for k in res])

This code establishes a connection to the SQLite database, where the chunks are stored, and retrieves the actual text based on their IDs.

5. Engaging LLM for Interpretation

DEFAULT_TEXT_QA_PROMPT_TMPL = Template(
    """
    "Context information is below.\n"
    "---------------------\n"
    "$context_str \n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: $query_str \n"
    "Answer: "
    """
)

send_to_llm = DEFAULT_TEXT_QA_PROMPT_TMPL.substitute(context_str=res_docs, query_str=query_text)

completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": send_to_llm}])
print(completion.choices[0].message.content)

We leverage a template-based approach to send a structured prompt to GPT-3.5. It takes the retrieved text chunks as context and then seeks an answer based on that context to the posed query.

The full code is provided below.


# Calling LLM with the retreived chunks

# Now query the Vector Index created earlier with Annoy and fetch the chunks from SQLite
from annoy import AnnoyIndex
import sqlite3
from sentence_transformers import SentenceTransformer
import openai
from string import Template

openai.api_key = "sk-..."


# Load the Sentence Transformer Model
# We'll use this model to convert our query text into a vector embedding.

model = SentenceTransformer('all-MiniLM-L6-v2')
VEC_INDEX_DIM = 384

u = AnnoyIndex(VEC_INDEX_DIM, 'angular')
u.load("/kaggle/working/vecindex.ann")

# Generate the Embedding for the Query Text
# Let's convert the query text into an embedding using the Sentence Transformer model


query_text = "what is zero-shot prompting"
embedding = model.encode([query_text])
input_vec = embedding[0]


# Retrieve the IDs of the top 10 most similar text chunks based on the query text's embedding:

chunk_ids = u.get_nns_by_vector(input_vec, 10, search_k=-1, include_distances=False)
print(chunk_ids)

# Retrieve the Actual Text Chunks from the SQLite Database
# First, establish a connection to the SQLite database.


con = sqlite3.connect("/kaggle/working/chunks.db")
cur = con.cursor()

# Then, retrieve and print the original query text:


list_chunk_ids = ','.join([str(k) for k in chunk_ids])
cur.execute("select chunk_id, chunk_text from pdf_chunks where chunk_id in (" + list_chunk_ids + ")")
res = cur.fetchall()
res_docs = '\n'.join([k[1] for k in res])

# Construct a template that will get filled with the results fetched and then sent to OpenAI 
# Template used from https://github.com/run-llama/llama_index/blob/main/llama_index/prompts/default_prompts.py
DEFAULT_TEXT_QA_PROMPT_TMPL = Template(
    """
    "Context information is below.\n"
    "---------------------\n"
    "$context_str \n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: $query_str \n"
    "Answer: "
    """
)

send_to_llm = DEFAULT_TEXT_QA_PROMPT_TMPL.substitute(context_str=res_docs, query_str=query_text)
        
completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": send_to_llm}])
print(completion.choices[0].message.content)

Conclusion

The synergy between efficient retrieval systems and advanced LLMs makes Retrieval-Augmented Generation a compelling strategy for AI application development. By understanding and combining the steps detailed in this chapter and previous ones, engineering students can now create end-to-end solutions for modern AI applications, driving value in real-world scenarios.