close

How to check if PDF is scanned image or contains text

Hello Guys, How are you all? Hope You all Are Fine. Today We Are Going To learn about How to check if PDF is scanned image or contains text in Python. So Here I am Explain to you all the possible Methods here.

Without wasting your time, Let’s start This Article.

Table of Contents

How to check if PDF is scanned image or contains text?

  1. How to check if PDF is scanned image or contains text?

    The below code will work, to extract data text data from both searchable and non-searchable PDF's.

  2. check if PDF is scanned image or contains text

    The below code will work, to extract data text data from both searchable and non-searchable PDF's.

Method 1

The below code will work, to extract data text data from both searchable and non-searchable PDF’s.

import fitz

text = ""
path = "Your_scanned_or_partial_scanned.pdf"

doc = fitz.open(path)
for page in doc:
    text += page.getText()

If you don’t have fitz module you need to do this:

pip install --upgrade pymupdf

Method 2

def get_pdf_searchable_pages(fname):
    # pip install pdfminer
    from pdfminer.pdfpage import PDFPage
    searchable_pages = []
    non_searchable_pages = []
    page_num = 0
    with open(fname, 'rb') as infile:

        for page in PDFPage.get_pages(infile):
            page_num += 1
            if 'Font' in page.resources.keys():
                searchable_pages.append(page_num)
            else:
                non_searchable_pages.append(page_num)
    if page_num > 0:
        if len(searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is non-searchable")
        elif len(non_searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is searchable")
        else:
            print(f"searchable_pages : {searchable_pages}")
            print(f"non_searchable_pages : {non_searchable_pages}")
    else:
        print(f"Not a valid document")


if __name__ == '__main__':
    get_pdf_searchable_pages("1.pdf")
    get_pdf_searchable_pages("1Scanned.pdf")

Output:

Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable

Summery

It’s all About this issue. Hope all Methods helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which Method worked for you? Thank You.

Also, Read