Processing PDFs for NLP

One of the challenges in the field of Natural Language Processing (NLP) is efficiently extracting textual information from various document types, such as PDFs. PDFs, while convenient for viewing and printing, are not always straightforward to extract information from due to the way they store and render textual data.

PyMuPDF (also known as `fitz``) stands out for its efficiency and accuracy in text extraction from PDFs.

Install the library pip install PyMuPDF

  • Initialization: First, you list all the PDF files in your designated directory using the os.listdir method. You can choose the PDF documents available at https://www.kaggle.com/datasets/harshsinghal/nlp-and-llm-related-arxiv-papers as your input document collection.

  • Opening PDFs: For each PDF, you open it with fitz.open(). This gives you an object that represents the entire document.

  • Page-by-page extraction: PDFs are typically composed of multiple pages. You can loop through each page and use the .get_text() method to obtain the textual data.


import os
import fitz

PDF_DIR = "./pdfdocuments/"
all_pdf_files = os.listdir(PDF_DIR)

for each_file in all_pdf_files:
    try:
        print("Parsing ... ", each_file)
        pdf_doc = fitz.open(PDF_DIR + each_file)
        num_pages = pdf_doc.page_count
        print("Total number of pages: ", num_pages)
        
        for page_num in range(num_pages):
            try: 
                current_page = pdf_doc.load_page(page_num).get_text()
                print(f"Text from page {page_num}:\n", current_page)
            except:
                print("Exception occurred while reading page.")
    except:
        print("Exception occurred while parsing the document.")

Chunking the Text

Now that you have the raw textual data from the PDF, the next step is processing. Often in NLP, it's beneficial to break down larger documents into smaller, more manageable pieces or "chunks".

Why chunking?:

  • Makes data more digestible: Smaller chunks are easier to analyze than one large body of text.

  • Better context understanding: It ensures that each piece of text retains a certain level of context, which is important for many NLP tasks.

  • Utilizing SpacyTextSplitter: The provided SpacyTextSplitter is a handy tool that leverages the power of the SpaCy library. By setting a chunk_size, you're specifying how many words approximately you want in each chunk.

from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=400)

for each_file in all_pdf_files:
    pdf_doc = fitz.open(PDF_DIR + each_file)
    for page_num in range(pdf_doc.page_count):
        current_page = pdf_doc.load_page(page_num).get_text()
        chunks = text_splitter.split_text(str(current_page))
        for chunk in chunks:
            print(chunk)

Check out the other text splitters available in langchain at https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.text_splitter

Storing Data in SQLite Database

After processing the text, you'll need a way to store it for later analysis. Databases provide an organized method for storing, retrieving, and managing data.

Why SQLite:

  • Lightweight: SQLite is a serverless, self-contained database engine. It's light enough to be included in mobile apps, desktop software, and large-scale applications.

  • No setup required: Unlike other databases, it doesn’t require a separate server or setup. Data is stored in a single file.

  • Data Storage Procedure:

    • Setting up the database: Before storing data, you need to set up the database structure. This involves creating tables and defining the types of data each column will hold.
    • Inserting data: For each chunk of text, you construct a dictionary containing the chunk's details and use an INSERT INTO SQL statement to add this data to the database.
    • Committing transactions: Databases operate using a transaction model. After inserting the data, you need to commit the transaction, confirming that you want the changes saved.
import fitz
import sqlite3
from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=400)

# The input path should be modified to match your requirements.
PDF_DIR = "/kaggle/input/nlp-and-llm-related-arxiv-papers/"
all_pdf_files = os.listdir(PDF_DIR)

# Lets take a sample of PDFs
all_pdf_files_sample = all_pdf_files[:20]


# Set up the SQLite database
con = sqlite3.connect("chunks.db")
cur = con.cursor()
cur.execute('''
CREATE TABLE IF NOT EXISTS pdf_chunks (
    chunk_id INTEGER PRIMARY KEY,
    chunk_text TEXT,
    page_number INTEGER,
    document_file_name TEXT
)
''')

START_CTR = 100

for each_file in all_pdf_files_sample:
    print("processing.. ", each_file)
    pdf_doc = fitz.open(PDF_DIR + each_file)
    for page_num in range(pdf_doc.page_count):
        current_page = pdf_doc.load_page(page_num).get_text()
        chunks = text_splitter.split_text(str(current_page))        
        for each_chunk in chunks:
            temp_dict = {
                'chunk_id': START_CTR,
                'chunk_text': each_chunk,
                'page_number': page_num,
                'document_file_name': each_file
            }
            
            cur.execute('''
            INSERT INTO pdf_chunks (chunk_id, chunk_text, page_number, document_file_name)
            VALUES (:chunk_id, :chunk_text, :page_number, :document_file_name);
            ''', temp_dict)
            
            con.commit()
            START_CTR += 1

con.close()

SQLite: The Unsung Hero of Modern Technology

Imagine a world where nearly every device you interact with, from your smartphone to your TV, secretly shares a common thread. Enter SQLite, the world's most widespread database engine. With its unparalleled reach, SQLite is the unsung hero embedded in the heart of countless technological tools and toys.

  • Every smartphone in your pocket, be it Android or iPhone.
  • The Mac you're typing on or the Windows10 machine you game with.
  • Web browsers like Firefox, Chrome, and Safari that help you explore the vast universe of the internet.
  • Applications like Skype keeping you connected, iTunes serenading your evenings, and Dropbox guarding your precious files.
  • TurboTax and QuickBooks ensuring your finances are in tip-top shape.
  • Even your TV set and the multimedia system in your car.

But that's not all! Developers adore SQLite too. With seamless integrations in popular programming languages like PHP and Python, it's a darling among coders.

Given that over 4 billion smartphones are currently buzzing around, each packed with hundreds of SQLite databases, it's quite plausible we're living in a world with over a trillion SQLite instances working silently in the background. Talk about an unsung technological marvel!

In summary this chapter has provided you with a holistic view of the journey of textual data: from extraction from a complex format (like PDFs), to processing it for NLP tasks (like chunking), and finally to securely and systematically storing the processed data for future analysis.