AI Product Development from 0 to 1

In an age where Artificial Intelligence is more than just a buzzword, mastery of its many facets is not only beneficial but crucial for the engineers of tomorrow.

The material here offers a comprehensive and hands-on introduction to the many components of AI product development, a journey that spans from the rudiments of processing data in PDFs to the intricate art of Retrieval Augmented Generation using state-of-the-art models like ChatGPT 3.5.

This book has been meticulously crafted for both novices and professionals, unraveling Vector Embeddings, the efficiencies of Nearest Neighbors, the scalability brought in by libraries like Annoy, and the robustness of web services via FastAPI.

Furthermore, understand the approach of Docker deployments, ensuring your AI solutions are versatile and avaialble to anyone online.

Whether you're an AI enthusiast, a budding engineer, or a seasoned developer, "AI Unveiled" promises to be an enlightening companion on your quest for AI excellence.

Welcome aboard this transformative journey!

Harsh Singhal


Content

Learn how to process and extract content from PDF documents.

Processing PDF Documents


Use SQLite to store the extracted content and also learn how to do full-text search using SQLite.

Search with SQLite


Text can be converted to numeric representation called Vector Embeddings. And once you have vectors you can find other similar vectors using Nearest Neighbor algorithms. Learn how to extract Vector Embeddings from text and find similar vectors using Nearest Neighbors in scikit-learn.

Search with Nearest Neighbors


Nearest Neighbor algorithms have variants which allow them to scale across large number of vectors (millions and more). Annoy, a Python library created at Spotify lets you create a Vector index and run approximate Nearest Neighbor algorithms for vector search.

Approximate Nearest Neighbors


FastAPI is a popular Python library to create a web service. These web services can be deployed on the cloud and users can make a call to your web service to run a Semantic Search query.

FastAPI Service


Don't worry about creating an environment on the cloud for your code to run. Create a Docker image with all the dependencies and run your image as a container anywhere. This has been a game-changer for developers who want to try things out easily (Docker images for almost all technologies are available and can run anywhere, even on your Windows laptop) and deploy even more easily.

Deploy with Docker


You may have heard of RAG solutions (if not, Google it now) and popular libraries like langchain or llama_index. Before you start using these libraries, learn how they work by building a RAG solution from scratch. So far in the previous chapters you have developed 80% of the components necessary and now you will call a Large Language Model from OpenAI, ChatGPT 3.5 for the final step.

Retrieval Augmented Generation

Supplements

Kaggle Notebook with all the code

Processing PDFs for NLP

One of the challenges in the field of Natural Language Processing (NLP) is efficiently extracting textual information from various document types, such as PDFs. PDFs, while convenient for viewing and printing, are not always straightforward to extract information from due to the way they store and render textual data.

PyMuPDF (also known as `fitz``) stands out for its efficiency and accuracy in text extraction from PDFs.

Install the library pip install PyMuPDF

  • Initialization: First, you list all the PDF files in your designated directory using the os.listdir method. You can choose the PDF documents available at https://www.kaggle.com/datasets/harshsinghal/nlp-and-llm-related-arxiv-papers as your input document collection.

  • Opening PDFs: For each PDF, you open it with fitz.open(). This gives you an object that represents the entire document.

  • Page-by-page extraction: PDFs are typically composed of multiple pages. You can loop through each page and use the .get_text() method to obtain the textual data.


import os
import fitz

PDF_DIR = "./pdfdocuments/"
all_pdf_files = os.listdir(PDF_DIR)

for each_file in all_pdf_files:
    try:
        print("Parsing ... ", each_file)
        pdf_doc = fitz.open(PDF_DIR + each_file)
        num_pages = pdf_doc.page_count
        print("Total number of pages: ", num_pages)
        
        for page_num in range(num_pages):
            try: 
                current_page = pdf_doc.load_page(page_num).get_text()
                print(f"Text from page {page_num}:\n", current_page)
            except:
                print("Exception occurred while reading page.")
    except:
        print("Exception occurred while parsing the document.")

Chunking the Text

Now that you have the raw textual data from the PDF, the next step is processing. Often in NLP, it's beneficial to break down larger documents into smaller, more manageable pieces or "chunks".

Why chunking?:

  • Makes data more digestible: Smaller chunks are easier to analyze than one large body of text.

  • Better context understanding: It ensures that each piece of text retains a certain level of context, which is important for many NLP tasks.

  • Utilizing SpacyTextSplitter: The provided SpacyTextSplitter is a handy tool that leverages the power of the SpaCy library. By setting a chunk_size, you're specifying how many words approximately you want in each chunk.

from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=400)

for each_file in all_pdf_files:
    pdf_doc = fitz.open(PDF_DIR + each_file)
    for page_num in range(pdf_doc.page_count):
        current_page = pdf_doc.load_page(page_num).get_text()
        chunks = text_splitter.split_text(str(current_page))
        for chunk in chunks:
            print(chunk)

Check out the other text splitters available in langchain at https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.text_splitter

Storing Data in SQLite Database

After processing the text, you'll need a way to store it for later analysis. Databases provide an organized method for storing, retrieving, and managing data.

Why SQLite:

  • Lightweight: SQLite is a serverless, self-contained database engine. It's light enough to be included in mobile apps, desktop software, and large-scale applications.

  • No setup required: Unlike other databases, it doesn’t require a separate server or setup. Data is stored in a single file.

  • Data Storage Procedure:

    • Setting up the database: Before storing data, you need to set up the database structure. This involves creating tables and defining the types of data each column will hold.
    • Inserting data: For each chunk of text, you construct a dictionary containing the chunk's details and use an INSERT INTO SQL statement to add this data to the database.
    • Committing transactions: Databases operate using a transaction model. After inserting the data, you need to commit the transaction, confirming that you want the changes saved.
import fitz
import sqlite3
from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=400)

# The input path should be modified to match your requirements.
PDF_DIR = "/kaggle/input/nlp-and-llm-related-arxiv-papers/"
all_pdf_files = os.listdir(PDF_DIR)

# Lets take a sample of PDFs
all_pdf_files_sample = all_pdf_files[:20]


# Set up the SQLite database
con = sqlite3.connect("chunks.db")
cur = con.cursor()
cur.execute('''
CREATE TABLE IF NOT EXISTS pdf_chunks (
    chunk_id INTEGER PRIMARY KEY,
    chunk_text TEXT,
    page_number INTEGER,
    document_file_name TEXT
)
''')

START_CTR = 100

for each_file in all_pdf_files_sample:
    print("processing.. ", each_file)
    pdf_doc = fitz.open(PDF_DIR + each_file)
    for page_num in range(pdf_doc.page_count):
        current_page = pdf_doc.load_page(page_num).get_text()
        chunks = text_splitter.split_text(str(current_page))        
        for each_chunk in chunks:
            temp_dict = {
                'chunk_id': START_CTR,
                'chunk_text': each_chunk,
                'page_number': page_num,
                'document_file_name': each_file
            }
            
            cur.execute('''
            INSERT INTO pdf_chunks (chunk_id, chunk_text, page_number, document_file_name)
            VALUES (:chunk_id, :chunk_text, :page_number, :document_file_name);
            ''', temp_dict)
            
            con.commit()
            START_CTR += 1

con.close()

SQLite: The Unsung Hero of Modern Technology

Imagine a world where nearly every device you interact with, from your smartphone to your TV, secretly shares a common thread. Enter SQLite, the world's most widespread database engine. With its unparalleled reach, SQLite is the unsung hero embedded in the heart of countless technological tools and toys.

  • Every smartphone in your pocket, be it Android or iPhone.
  • The Mac you're typing on or the Windows10 machine you game with.
  • Web browsers like Firefox, Chrome, and Safari that help you explore the vast universe of the internet.
  • Applications like Skype keeping you connected, iTunes serenading your evenings, and Dropbox guarding your precious files.
  • TurboTax and QuickBooks ensuring your finances are in tip-top shape.
  • Even your TV set and the multimedia system in your car.

But that's not all! Developers adore SQLite too. With seamless integrations in popular programming languages like PHP and Python, it's a darling among coders.

Given that over 4 billion smartphones are currently buzzing around, each packed with hundreds of SQLite databases, it's quite plausible we're living in a world with over a trillion SQLite instances working silently in the background. Talk about an unsung technological marvel!

In summary this chapter has provided you with a holistic view of the journey of textual data: from extraction from a complex format (like PDFs), to processing it for NLP tasks (like chunking), and finally to securely and systematically storing the processed data for future analysis.

Introduction to Full-Text Search in SQLite using FTS5

In the world of databases, the ability to perform lightning-quick searches on large bodies of text is invaluable.

SQLite, despite being a lightweight, serverless database engine, packs a powerful punch in this domain with its Full-Text Search (FTS) extension. The FTS5 is the latest version of this extension, and in this chapter, we'll explore how to set it up and use it to search across the chunks from the PDF documents that we've previously extracted.

What is Full-Text Search (FTS)?

In simple terms, Full-Text Search allows you to perform complex search operations on large text columns. Instead of examining each record one by one, FTS creates a virtual table with a structure optimized for searching, making the process incredibly fast.

Setting up an FTS5 Virtual Table in SQLite

Before you can use FTS5, you need to set up a virtual table. Here's how you can do it:


CREATE VIRTUAL TABLE pdf_chunks_srch USING fts5(
    chunk_text, 
    page_number UNINDEXED,
    document_file_name UNINDEXED,
    content='pdf_chunks', 
    content_rowid='chunk_id' 
);

Breaking Down the Query:

CREATE VIRTUAL TABLE pdf_chunks_srch USING fts5: This command sets up a virtual table named pdf_chunks_srch using the FTS5 module.

The columns inside the parentheses are the ones you want to include in the virtual table. chunk_text is the main text column we'll be searching on.

page_number UNINDEXED and document_file_name UNINDEXED: By marking columns as UNINDEXED, we're telling SQLite not to include these columns in the FTS index, which can make searches faster.

content='pdf_chunks': This indicates the real table that the virtual table should pull data from.

content_rowid='chunk_id': This specifies the primary key of the real table.

Populating the Virtual Table

Once the virtual table is set up, we need to populate it with data from our main table:

INSERT INTO pdf_chunks_srch (rowid, chunk_text, page_number, document_file_name)
SELECT chunk_id, chunk_text, page_number,document_file_name from pdf_chunks;

This command pulls data from our main pdf_chunks table and inserts it into our pdf_chunks_srch virtual table.

Searching Using FTS5

With our virtual table populated, we can now perform full-text searches. Let's say you want to find chunks containing the word "prompting":


SELECT chunk_text, page_number, document_file_name 
FROM pdf_chunks_srch 
WHERE pdf_chunks_srch MATCH 'prompting';

This will quickly return all the relevant chunks containing the word "contract", along with their page numbers and document filenames.

More advanced topics on full-text search in SQLite can be found here

Conclusion

Full-Text Search, especially with the FTS5 module in SQLite, provides a robust and efficient way to search through large text datasets. It's especially useful in scenarios like ours, where we've chunked and stored significant amounts of text from PDFs. With the combination of SQLite and FTS5, searching through this data becomes a breeze.

Extracting Text Embeddings and Nearest Neighbors Search

Modern Natural Language Processing (NLP) techniques often require transforming textual data into numerical vectors, known as embeddings. These embeddings capture the semantic meaning of the text. Once in this form, various machine learning algorithms can process the data, and one of the most common tasks is finding texts that are semantically similar.

Some very good articles on embeddings are mentioned below.

In this chapter, we'll explore how to use the sentence_transformers library to generate embeddings for chunks of text. After that, we'll delve into how you can use scikit-learn's Nearest Neighbors module to find similar text chunks based on their embeddings.

  1. Extracting Embeddings using Sentence Transformers ====

The `sentence_transformers`` library provides an easy and efficient way to convert sentences into embeddings.

Setting up:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

Here, we've initialized a Sentence Transformer model called all-MiniLM-L6-v2. There are various models available, and you can choose the one that best suits your requirements.

Embedding Extraction:

embeddings = model.encode(chunk_text)

The encode method takes a chunk of text and returns its embedding. This embedding can be saved for future use.

  1. Using Scikit-Learn's Nearest Neighbors for Similarity Search ====

Before diving into Nearest Neighbors, let's understand the purpose.

Given a chunk of text, you might want to find other text chunks that are semantically similar to it. This is where the Nearest Neighbors algorithm comes into play.

Setting up:

To use the Nearest Neighbors module, first, you need to install and import it:

from sklearn.neighbors import NearestNeighbors


# Assuming embeddings_list holds all the embeddings
X = [item[1] for item in embeddings_list]

nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(X)
# Here, the model will return the five most similar text chunks when queried.


#To find text chunks similar to a given chunk:


distances, indices = nbrs.kneighbors([model.encode("Your input text here")])
# This will give you the indices of the text chunks that are most similar to the input text.

Putting it all together:


import sqlite3
import pickle
from sentence_transformers import SentenceTransformer
from sklearn.neighbors import NearestNeighbors

# 1. Read the text chunks from the SQLite database

# Connect to the SQLite database
conn = sqlite3.connect('../chunks.db')
cursor = conn.cursor()

# Select all rows from the pdf_chunks table
# We will sample the first 100 for the class
cursor.execute("SELECT chunk_id, chunk_text FROM pdf_chunks limit 100")
rows = cursor.fetchall()

conn.close()

# 2. Convert these chunks into embeddings

model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings_list = []

for row in rows:
    chunk_id = row[0]
    chunk_text = row[1]
    embeddings = model.encode(str(chunk_text))
    embeddings_list.append((chunk_id, chunk_text, embeddings))

# For the sake of the example, let's save the embeddings to a file (optional)
with open('embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings_list, f)

# 3. Use scikit-learn's Nearest Neighbors to search for similar chunks

# First build the model
X = [item[2] for item in embeddings_list]
nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(X)

# Function to search for similar text chunks
def find_similar_chunks(input_text):
    distances, indices = nbrs.kneighbors([model.encode(input_text)])
    return [embeddings_list[i] for i in indices[0]]
    

# Test
input_chunk = "prompting techniques"
similar_chunks = find_similar_chunks(input_chunk)
print("Text chunks similar to:", input_chunk)
for chunk in similar_chunks:
    print(chunk[1])

Conclusion

Text embeddings have revolutionized the way we handle and process textual data.

By converting text into numerical form, not only can we better understand the semantic meaning behind words and sentences, but we can also apply machine learning algorithms to derive insights, group similar content, and more.

Approximate Nearest Neighbor Search with Annoy

As we navigate deeper into the realm of machine learning and natural language processing, the need for efficient search mechanisms becomes paramount. Imagine sifting through millions of document embeddings or images to find the ones most similar to a given input. Traditional exact search methods can be computationally intensive and slow.

Enter the world of Approximate Nearest Neighbor (ANN) algorithms, where speed and scalability are at the forefront.

The Need for Approximate Nearest Neighbor Algorithms

While exact nearest neighbor search can guarantee the most similar items, it often does so at the cost of performance, especially with large datasets in high-dimensional spaces. This is where approximate methods shine:

  • Speed: ANN algorithms can find neighbors incredibly fast, making them suitable for real-time applications.
  • Scalability: They can handle vast datasets, as they don't need to scan every item in the dataset.
  • Reduced Computational Overhead: By allowing small compromises on accuracy, they significantly reduce the computation required.

However, with various ANN algorithms available, why should one consider Annoy?

Introduction to Annoy

Annoy, which stands for "Approximate Nearest Neighbors Oh Yeah", is a powerful library that provides a balance between speed and accuracy. But its uniqueness lies in features that cater to modern distributed computing needs.

Salient Features of Annoy:

  • Static Files as Indexes: Annoy's ability to use static files for indexes means you can share an index across different processes. This decoupling of index creation and loading provides a level of flexibility rarely seen.

  • Minimal Memory Footprint: In scenarios with millions or even billions of vectors, memory usage becomes critical. Annoy ensures the indexes are compact.

  • Built for Production: The architecture of Annoy is such that you can easily pass around and distribute static index files, making it apt for production environments, Hadoop jobs, and more.

  • Real-world Usage: An endorsement of its capabilities, Annoy is used by Spotify for music recommendations. By representing every user/item as a vector post-matrix factorization, Annoy aids in efficiently searching for similar users or items, even when dealing with millions of tracks in high-dimensional space.

import sqlite3
from annoy import AnnoyIndex
from sentence_transformers import SentenceTransformer

# Load the SentenceTransformer model for encoding text chunks
model = SentenceTransformer("all-MiniLM-L6-v2")

# Dimensions for the vector embeddings (adjust based on model's output)
VEC_INDEX_DIM = 384

# Initialize the Annoy index for storing and searching vectors
vec_index = AnnoyIndex(VEC_INDEX_DIM, 'angular')

# Connect to the SQLite database
conn = sqlite3.connect('chunks.db')
cursor = conn.cursor()


# Fetch a batch of rows based on the offset and limit
cursor.execute(f"SELECT chunk_id, chunk_text FROM pdf_chunks LIMIT 100")
rows = cursor.fetchall()

# For each row in the batch, extract the text chunk, encode it, and add it to the Annoy index
for row in rows:
    chunk_id = row[0]
    chunk_text = row[1]
    embeddings = model.encode(str(chunk_text))
    vec_index.add_item(chunk_id, embeddings)          

# Close the SQLite database connection
conn.close()

# Build the Annoy index for efficient querying
vec_index.build(100)

# Save the built index to a file for future use
vec_index.save("./vecindex.ann")

Querying an Annoy Vector Index with Text Embeddings

Vector search is a powerful way to find items with similar characteristics. In the realm of text processing, these characteristics are usually captured as embeddings – high-dimensional vectors that encapsulate semantic information. By querying a vector index like Annoy with a text's embedding, you can quickly find texts that are semantically similar.

Here's how to harness this capability:

from annoy import AnnoyIndex
import sqlite3
from sentence_transformers import SentenceTransformer


# Load the Sentence Transformer Model
# We'll use this model to convert our query text into a vector embedding.


model = SentenceTransformer('all-MiniLM-L6-v2')
VEC_INDEX_DIM = 384

u = AnnoyIndex(VEC_INDEX_DIM, 'angular')
u.load("/kaggle/working/vecindex.ann")

# Generate the Embedding for the Query Text
# Let's convert the query text into an embedding using the Sentence Transformer model


text = "zero-shot prompting"
embedding = model.encode([text])
input_vec = embedding[0]


# Retrieve the IDs of the top 10 most similar text chunks based on the query text's embedding:

chunk_ids = u.get_nns_by_vector(input_vec, 10, search_k=-1, include_distances=False)
print(chunk_ids)

# Retrieve the Actual Text Chunks from the SQLite Database
# First, establish a connection to the SQLite database.


con = sqlite3.connect("/kaggle/working/chunks.db")
cur = con.cursor()

# Then, retrieve and print the original query text:


list_chunk_ids = ','.join([str(k) for k in chunk_ids])
cur.execute("select chunk_id, chunk_text from pdf_chunks where chunk_id in (" + list_chunk_ids + ")")
res = cur.fetchall()

for i in res:
    print(i[1])
    print("----------")

By following this process, you can easily search for and retrieve texts semantically similar to a given input, even from a vast dataset, in a matter of milliseconds!

Conclusion

The age of big data necessitates tools and techniques that can handle the volume, velocity, and variety of information.

While traditional search algorithms have their place, when it comes to large datasets in high-dimensional spaces, approximate methods like Annoy emerge as the frontrunners.

Whether you're recommending songs, finding similar images, or sifting through document embeddings, Annoy offers a scalable and efficient solution, ensuring your applications remain swift and responsive.

FastAPI: The Modern-day Framework for AI Applications in Python

In the dynamic and ever-evolving landscape of technology, the ability to develop and deploy AI models with speed, accuracy, and scalability is paramount. The fusion of AI with modern web technologies allows developers to build powerful applications that are not just intelligent, but also highly interactive and accessible. Among the multitude of frameworks and tools available to Python developers today, FastAPI stands out as a front-runner. Here's a brief introduction to FastAPI and why it is becoming an indispensable skill for the contemporary AI engineer.

What is FastAPI?

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints. The key features are:

  • Fast: The name 'FastAPI' isn't just a catchy moniker. One of its biggest draws is its performance. FastAPI is often reported to be on par with NodeJS and Go in terms of speed, and significantly faster than traditional Python frameworks.

  • Pythonic: FastAPI is designed with the Pythonista in mind. Its syntax and utilization of Python’s type hints make the code clean, intuitive, and easy to read.

  • Automatic Interactive API Documentation: With FastAPI, once your API is built, it automatically provides an interactive documentation (using tools like Swagger UI and ReDoc) allowing users and developers to understand and test your API's endpoints seamlessly.

FastAPI and AI: A Power Pairing

AI engineers work diligently to design, train, and fine-tune models that can then be served to end-users. But creating a model is just half the battle; presenting it in a usable format through an API is equally important. FastAPI simplifies this process. Here’s why it's particularly apt for AI:

  • Seamless Integration with ML Libraries: FastAPI can easily work alongside popular ML libraries like TensorFlow, PyTorch, and Scikit-learn, ensuring that the transition from model development to deployment is smooth.

  • Asynchronous Capabilities: Many AI applications, especially those involving deep learning, can be compute-intensive. FastAPI's support for asynchronous request handling means that even while a model is processing a request, the API remains responsive.

  • Data Validation: FastAPI's use of Python type hints doesn't just make for cleaner code; it provides automatic request validation. This is particularly useful for AI apps where the format and type of input data are critical for model inference.

Why Every AI Engineer Should Consider Learning FastAPI

In the age of AI-driven applications, the bridge between robust machine learning models and end-user accessibility is a well-designed API. FastAPI facilitates this bridge. It allows AI engineers to focus on what they do best - designing and refining models - while ensuring that the deployment and scaling process is streamlined, efficient, and hassle-free.

The modern AI engineer isn't just a model builder; they are a solution provider. And in the world of solutions, FastAPI is becoming an increasingly essential tool in the AI engineer’s arsenal.

As you delve deeper into the chapters of this book, you'll gain insights into how FastAPI can be harnessed to bring your AI applications to life, making them accessible to the world, one endpoint at a time.

To create a FastAPI service that serves the purpose, we'll follow these steps:

  • Set up FastAPI.
  • Define the route (or endpoint) that accepts the text input via GET request.
  • Integrate the provided code into the defined route.
  • Return the result as JSON.

Below is the FastAPI service:

from fastapi import FastAPI
from annoy import AnnoyIndex
import sqlite3
from sentence_transformers import SentenceTransformer
from typing import List, Dict

app = FastAPI()

# Load the Sentence Transformer Model
model = SentenceTransformer('all-MiniLM-L6-v2')
VEC_INDEX_DIM = 384

# Load the Annoy index
u = AnnoyIndex(VEC_INDEX_DIM, 'angular')
u.load("/kaggle/working/vecindex.ann")

# SQLite connection
con = sqlite3.connect("/kaggle/working/chunks.db")
cur = con.cursor()

@app.get("/find_similar_text/", response_model=List[Dict[str, str]])
async def read_similar_text(query_text: str):
    """
    Given a query_text, find the top 10 text chunks from the database that are semantically similar.
    """

    # Convert the query text into an embedding
    embedding = model.encode([query_text])
    input_vec = embedding[0]

    # Retrieve the IDs of the top 10 most similar text chunks
    chunk_ids = u.get_nns_by_vector(input_vec, 10, search_k=-1, include_distances=False)
    
    # Fetch the actual text chunks from the SQLite database
    list_chunk_ids = ','.join([str(k) for k in chunk_ids])
    cur.execute("select chunk_id, chunk_text from pdf_chunks where chunk_id in (" + list_chunk_ids + ")")
    res = cur.fetchall()
    
    # Construct the result list
    result = [{"chunk_id": str(chunk[0]), "chunk_text": chunk[1]} for chunk in res]
    return result


# You would then run this API using a tool like Uvicorn and send GET requests to the defined endpoint.

To test this FastAPI service:

Run the FastAPI app using Uvicorn.

Use the endpoint /find_similar_text/?query_text=YOUR_TEXT_HERE to query for similar texts.

This FastAPI service provides a convenient and efficient way to query for similar texts, making it highly useful in various NLP applications.

If you are developing in Kaggle, you can install ngrok and get a public URL where your FastAPI service will be deployed.

ngrok puts localhost on the internet.

!pip install fastapi nest-asyncio pyngrok uvicorn


import nest_asyncio
from pyngrok import ngrok
import uvicorn


# specify a port
port = 8000
ngrok_tunnel = ngrok.connect(port)

# where we can visit our fastAPI app
print('Public URL:', ngrok_tunnel.public_url)


nest_asyncio.apply()

# finally run the app
uvicorn.run(app, port=port)

After you receive the URL you have to authenticate on the ngrok site and configure an auth-token. On receiving this token you have to configure in a notebook cell and run the code snippet shown above again.

!ngrok config add-authtoken <<auth_token>>

Replace <<auth_token>> with the token you receive on the ngrok website.

To test the API use Postman

Docker: Containerizing the Future of AI Engineering

In the world of software development and deployment, there exists a notorious phrase: "It works on my machine."

For many years, this statement has epitomized the challenges developers face when their applications run seamlessly in one environment, but encounter issues in another.

Enter Docker: a solution that has revolutionized how we think about consistency, scalability, and deployment in software development. For AI engineers, the adoption of Docker can be a game-changer. In this chapter, we will explore why every modern-day AI engineer should make Docker a key tool in their repertoire.

What is Docker?

Docker is a platform designed to develop, ship, and run applications inside containers. A container encapsulates an application along with all its dependencies, libraries, and binaries in one package. This ensures that the application will run identically regardless of where the container is deployed.

AI Engineering and the Need for Consistency

AI models, by nature, are complex entities that rely on a specific stack of libraries, dependencies, and environmental variables. A slight change in any of these elements could lead to variations in the model's performance or, worse, complete failures. Docker's containerization ensures that once a model is trained, it can be wrapped with its entire environment, ensuring consistent results from development to production.

Scalability and Portability in the Age of AI

Modern AI applications are not limited to powerful servers or data centers. They are deployed on the cloud, edge devices, and even on IoT devices. Docker containers can be effortlessly moved across these environments, ensuring that AI applications are scalable and portable.

Streamlining the AI Workflow

Training AI models is just one part of an AI engineer's journey. Model serving, versioning, and continuous integration and deployment are integral aspects of bringing AI to the real world. Docker simplifies these processes by providing a unified framework where models can be easily versioned, shared, and deployed without the overhead of traditional setup and configuration.

Embracing Docker isn't merely about adopting a new technology; it's about embracing a paradigm that ensures the fruits of AI engineering can be reliably and consistently enjoyed by end-users.

As we delve deeper into this chapter, we will unpack the technicalities, best practices, and transformative potential of Docker in the world of AI.

Below is a Dockerfile to set up and run the FastAPI service:

Dockerfile

# Use an official Python runtime as the parent image
FROM python:3.8-slim

# Install build utilities for Annoy
RUN apt-get update && \
    apt-get install -y --no-install-recommends build-essential && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Set the working directory in the container
WORKDIR /usr/src/app

# Copy the local code to the container
COPY . .

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir fastapi[all] uvicorn sentence_transformers annoy

# Make port 80 available to the world outside this container
EXPOSE 80

# Define environment variable for Uvicorn
ENV UVICORN_HOST 0.0.0.0
ENV UVICORN_PORT 80

# Run the FastAPI application using Uvicorn when the container launches
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]

Here's a quick guide to build and run the Docker container:

First create a directory locally.

  • Save the above content in a file named Dockerfile in the root directory of your FastAPI application.
  • In the same directory, save your FastAPI code in a file named main.py
  • Download the vecindex.ann and chunks.db files from your Kaggle notebook in the same directory where main.py and Dockerfile are stored.
  • Navigate to the root directory (where the Dockerfile is located) in your terminal or command prompt.
  • Build the Docker image by issuing the below command in the Terminal: docker build -t fastapi-service .
  • After the build completes, run the Docker container: docker run -p 80:80 fastapi-service

You can then access the FastAPI service in your browser or using tools like curl at http://localhost:80.

Remember, Docker provides a self-contained environment, so this setup ensures that all dependencies, including the necessary build tools for Annoy, are encapsulated within the Docker image, allowing for easy deployment and scaling.

The code snippet below will allow you to test your service.


import requests

# Define the endpoint URL
url = "http://localhost:80/find_similar_text/"

# Define the query parameters
params = {
    "query_text": "few-shot prompting"
}

# Make the GET request
response = requests.get(url, params=params)

# Check if the request was successful
if response.status_code == 200:
    similar_texts = response.json()
    for i, text in enumerate(similar_texts, 1):
        print(f"{i}. {text['chunk_text']}\n")
else:
    print(f"Error {response.status_code}: {response.text}")

Retrieval-Augmented Generation using Language Models

The previous chapters equipped you with the skills to extract, chunk, and embed information from diverse PDF sources.

The essence of AI-driven solutions lies not just in the retrieval of data but, more importantly, in its effective interpretation. This is where advanced language models (LLMs), like OpenAI's GPT variants, come into play.

The Power of Retrieval-Augmented Generation (RAG)

Before diving into the code, it's crucial to grasp the paradigm of Retrieval-Augmented Generation (RAG). RAG seamlessly combines the strengths of two worlds:

  • Retrieval Systems: These systems efficiently retrieve chunks of relevant data based on vector similarity, like Annoy.

  • Generative Language Models (LLM): These are sophisticated models that can generate human-like text based on given prompts, such as GPT-3.5.

When united, retrieval systems pull the most relevant data, and generative LLMs interpret this data, providing insightful responses or interpretations. This combo is potent for a myriad of applications like question answering, document summarization, and more.

Leveraging LLM for RAG: A Deep Dive

The provided code walks you through how to combine the power of retrieval systems with LLMs for a comprehensive solution:

1. Setting up Dependencies

from annoy import AnnoyIndex
import sqlite3
from sentence_transformers import SentenceTransformer
import openai
from string import Template

openai.api_key = "YOUR_API_KEY"

Here, we're importing all the necessary libraries. Notice AnnoyIndex for retrieval, SentenceTransformer for embedding, and openai for leveraging the LLM.

2. Loading the Sentence Transformer Model


model = SentenceTransformer('all-MiniLM-L6-v2')
VEC_INDEX_DIM = 384

The sentence transformer model is loaded to generate embeddings for any new queries we might want to make.

3. Querying the Vector Index


u = AnnoyIndex(VEC_INDEX_DIM, 'angular')
u.load("/path_to_vector_index/vecindex.ann")

query_text = "your_query_here"
embedding = model.encode([query_text])
input_vec = embedding[0]

chunk_ids = u.get_nns_by_vector(input_vec, 10, search_k=-1, include_distances=False)

Here, we load the pre-constructed Annoy index and fetch the top 10 similar text chunks to our query_text based on vector similarity.

4. Fetching Relevant Text Chunks


con = sqlite3.connect("/path_to_database/chunks.db")
cur = con.cursor()

list_chunk_ids = ','.join([str(k) for k in chunk_ids])
cur.execute("select chunk_id, chunk_text from pdf_chunks where chunk_id in (" + list_chunk_ids + ")")
res = cur.fetchall()
res_docs = '\n'.join([k[1] for k in res])

This code establishes a connection to the SQLite database, where the chunks are stored, and retrieves the actual text based on their IDs.

5. Engaging LLM for Interpretation

DEFAULT_TEXT_QA_PROMPT_TMPL = Template(
    """
    "Context information is below.\n"
    "---------------------\n"
    "$context_str \n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: $query_str \n"
    "Answer: "
    """
)

send_to_llm = DEFAULT_TEXT_QA_PROMPT_TMPL.substitute(context_str=res_docs, query_str=query_text)

completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": send_to_llm}])
print(completion.choices[0].message.content)

We leverage a template-based approach to send a structured prompt to GPT-3.5. It takes the retrieved text chunks as context and then seeks an answer based on that context to the posed query.

The full code is provided below.


# Calling LLM with the retreived chunks

# Now query the Vector Index created earlier with Annoy and fetch the chunks from SQLite
from annoy import AnnoyIndex
import sqlite3
from sentence_transformers import SentenceTransformer
import openai
from string import Template

openai.api_key = "sk-..."


# Load the Sentence Transformer Model
# We'll use this model to convert our query text into a vector embedding.

model = SentenceTransformer('all-MiniLM-L6-v2')
VEC_INDEX_DIM = 384

u = AnnoyIndex(VEC_INDEX_DIM, 'angular')
u.load("/kaggle/working/vecindex.ann")

# Generate the Embedding for the Query Text
# Let's convert the query text into an embedding using the Sentence Transformer model


query_text = "what is zero-shot prompting"
embedding = model.encode([query_text])
input_vec = embedding[0]


# Retrieve the IDs of the top 10 most similar text chunks based on the query text's embedding:

chunk_ids = u.get_nns_by_vector(input_vec, 10, search_k=-1, include_distances=False)
print(chunk_ids)

# Retrieve the Actual Text Chunks from the SQLite Database
# First, establish a connection to the SQLite database.


con = sqlite3.connect("/kaggle/working/chunks.db")
cur = con.cursor()

# Then, retrieve and print the original query text:


list_chunk_ids = ','.join([str(k) for k in chunk_ids])
cur.execute("select chunk_id, chunk_text from pdf_chunks where chunk_id in (" + list_chunk_ids + ")")
res = cur.fetchall()
res_docs = '\n'.join([k[1] for k in res])

# Construct a template that will get filled with the results fetched and then sent to OpenAI 
# Template used from https://github.com/run-llama/llama_index/blob/main/llama_index/prompts/default_prompts.py
DEFAULT_TEXT_QA_PROMPT_TMPL = Template(
    """
    "Context information is below.\n"
    "---------------------\n"
    "$context_str \n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: $query_str \n"
    "Answer: "
    """
)

send_to_llm = DEFAULT_TEXT_QA_PROMPT_TMPL.substitute(context_str=res_docs, query_str=query_text)
        
completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": send_to_llm}])
print(completion.choices[0].message.content)    


Conclusion

The synergy between efficient retrieval systems and advanced LLMs makes Retrieval-Augmented Generation a compelling strategy for AI application development. By understanding and combining the steps detailed in this chapter and previous ones, engineering students can now create end-to-end solutions for modern AI applications, driving value in real-world scenarios.

About Harsh Singhal

Harsh Singhal side profile picture

Harsh Singhal is a global Data Science and Machine Learning leader.

With over 15 years of experience, Harsh is an industry-recognized leader in Machine Learning, Data Science, and Artificial Intelligence. Harsh's career journey spans various verticals across global markets and top-tech companies like LinkedIn, Netflix and Palo Alto Networks with a focus on delivering data-driven solutions and driving business outcomes.

Harsh has a proven record of building and scaling high-performance teams - a testament to this is his success at Koo, where Harsh spearheaded the expansion of ML/Data Science teams.

As a result-driven professional, Harsh has applied ML/AI techniques to a wide array of business problems, from bot detection and spam detection to sales product recommendation and account takeover prevention during my time at LinkedIn and Netflix.

This broad spectrum of experience shows his flexibility and adaptability in handling complex business challenges.

Harsh's innovative approach is underpinned by patents in key areas like bot detection and threat detection. Harsh also has publications in renowned forums like the IEEE Systems, Man, and Cybernetics Society.

Harsh maintains an online publication datascience.fm that attracts thousands of readers every month and has seen contributions from student leaders and industry professionals.

Harsh has developed Molecule Search to provide molecule-based patent search to medicinal chemistry enthusiasts. This product applies vector similarity search using RDKit Postgres extension. The product also includes ChatGPT based patent summary and applicant patent landscape analysis.

Molecule Search is a great example of a data product.

Harsh started his career with Mu Sigma in 2008 where he developed the first ever industry curricula for R. At Mu Sigma, Harsh was a founding member of the Innovations and Development team where he worked on automating Machine Learning models, a paradigm that was later termed as AutoML. Harsh relocated to the Bay Area in 2011 after joining LinkedIn's Bangalore office and being hired as the first Data Scientist in LinkedIn India.

Between 2011 and 2020 Harsh lived in the Bay Area, California where he developed impactful Data Science & Machine Learning (DSML) solutions at companies such as LinkedIn and Netflix

After having spent a decade in California, Harsh decided to move to India in early 2021.

Harsh joined Koo in late 2021 to build their ML/AI team. Harsh quickly scaled the ML team from 3 to 20 engineers, comprising folks in Data Science, Machine Learning, and ML Ops.

Koo has been downloaded by more than 50M+ users across the world.

Under Harsh's leadership the team successfully delivered ML-powered product features such as ChatGPT assisted writing tools for creators, Semantic Search, Multilingual Topics, People You May Know, Content Recommendation, Feed Ranking, Trending Topics, led all aspects of Content Moderation and Spam detection, and was responsible for developing all personalization features across the app.

A key highlight of Harsh's contributions at Koo was to deliver an industry-first multilingual Topics feature. This feature allowed every user irrespective of their native language to find content based on their Topic of interest and increased the average retention and time spent amongst millions of Koo users. Topics was developed on the back of many innovations in the field such as the use of multilingual embeddings, open-source model training and deployment technologies such as Ludwig and BentoML.

Koo collaborated with AI4Bharat very closely to develop KooBERT, the first open source BERT model training on multilingual microblog content.

Personalization features such as Recommended For You enabled creators to be disovered more efficiently by users who used Koo to connect with their favorite content and creators. These features were developed using large-scale data engineering technologies such as Spark and recommender system algorithms such as ALS.

The Content Moderation technologies developed by the ML team led by Harsh provided a safe environment for users. The team adopted cutting edge infrastructure elements such as Vector databases like Milvus and fine-tuned Large Language Models (LLMs) such as LLama2 and mT5 to detect toxicity.

Harsh has been invited to panel discussion on Indian language tech covered by GoI think tank on AI and is a strong supporter of developing technologies out of India.

Harsh has a YouTube channel where he posts videos on a variety of topics of interest to professionals in the Data and AI ecosystem.

Harsh Singhal actively works with student communities and guides them to excel in their journey towards DSML excellence. Harsh is also involved as an advisor in developing DSML curricula at academic institutions to increase AI talent density amongst India's student community.

Harsh Singhal delivering seminar in RIT, Bangalore
Harsh Singhal delivering seminar in Down Town University, Guwahati