Introduction to Full-Text Search in SQLite using FTS5

In the world of databases, the ability to perform lightning-quick searches on large bodies of text is invaluable.

SQLite, despite being a lightweight, serverless database engine, packs a powerful punch in this domain with its Full-Text Search (FTS) extension. The FTS5 is the latest version of this extension, and in this chapter, we'll explore how to set it up and use it to search across the chunks from the PDF documents that we've previously extracted.

What is Full-Text Search (FTS)?

In simple terms, Full-Text Search allows you to perform complex search operations on large text columns. Instead of examining each record one by one, FTS creates a virtual table with a structure optimized for searching, making the process incredibly fast.

Setting up an FTS5 Virtual Table in SQLite

Before you can use FTS5, you need to set up a virtual table. Here's how you can do it:


CREATE VIRTUAL TABLE pdf_chunks_srch USING fts5(
    chunk_text, 
    page_number UNINDEXED,
    document_file_name UNINDEXED,
    content='pdf_chunks', 
    content_rowid='chunk_id' 
);

Breaking Down the Query:

CREATE VIRTUAL TABLE pdf_chunks_srch USING fts5: This command sets up a virtual table named pdf_chunks_srch using the FTS5 module.

The columns inside the parentheses are the ones you want to include in the virtual table. chunk_text is the main text column we'll be searching on.

page_number UNINDEXED and document_file_name UNINDEXED: By marking columns as UNINDEXED, we're telling SQLite not to include these columns in the FTS index, which can make searches faster.

content='pdf_chunks': This indicates the real table that the virtual table should pull data from.

content_rowid='chunk_id': This specifies the primary key of the real table.

Populating the Virtual Table

Once the virtual table is set up, we need to populate it with data from our main table:

INSERT INTO pdf_chunks_srch (rowid, chunk_text, page_number, document_file_name)
SELECT chunk_id, chunk_text, page_number,document_file_name from pdf_chunks;

This command pulls data from our main pdf_chunks table and inserts it into our pdf_chunks_srch virtual table.

Searching Using FTS5

With our virtual table populated, we can now perform full-text searches. Let's say you want to find chunks containing the word "prompting":


SELECT chunk_text, page_number, document_file_name 
FROM pdf_chunks_srch 
WHERE pdf_chunks_srch MATCH 'prompting';

This will quickly return all the relevant chunks containing the word "contract", along with their page numbers and document filenames.

More advanced topics on full-text search in SQLite can be found here

Conclusion

Full-Text Search, especially with the FTS5 module in SQLite, provides a robust and efficient way to search through large text datasets. It's especially useful in scenarios like ours, where we've chunked and stored significant amounts of text from PDFs. With the combination of SQLite and FTS5, searching through this data becomes a breeze.