|

Is there any Python way to identify if the PDF has been OCR’d (the quality of the text is bad) vs a searchable PDF (the quality of the text is perfect)?

Question Description: Check if a PDF searchable has been OCR’d Or is a PDF searchable TRUE

Using metadata of pdf

import pprint 
import PyPDF2
def get_doc_info(path):
    pp  = pprint.PrettyPrinter(indent =4)
    pdf_file = PyPDF2.PdfFileReader(path, 'rb')
    doc_info = pdf_file.getDocumentInfo()
    pp.pprint(doc_info)

I find :

result = get_doc_info(PDF_SEARCHABLE_HAS_BEEN_OCRD.pdf)
{   '/Author': 'NAPS2',
    '/CreationDate': "D:20200701104101+02'00'",
    '/Creator': 'NAPS2',
    '/Keywords': '',
    '/ModDate': "D:20200701104101+02'00'",
    '/Producer': 'PDFsharp 1.50.4589 (www.pdfsharp.com)'}



result = get_doc_info(PDF_SEARCHABLE_TRUE.pdf)
{   '/CreationDate': 'D:20210802122000Z',
    '/Creator': 'Quadient CXM AG~Inspire~14.3.49.7',
    '/Producer': ''}

Can i check the type of the PDF (True PDF or OCR PDF) using Creator from metaData of the PDF?

There is another way to use python?

If there is no solution to the problem, how can I use deep learning/Machine learning to detect the type of the pdf searchable (True or OCR)?

This is a video to understand the difference between TRUE PDF and OCR PDF: https://www.youtube.com/watch?v=xs8KQbxsMcw

Expert Answer

An Expert developed (based on some SO post I cannot recall) this function to solve this problem:

def get_scanned_pages_percentage(filepath: str) -> float:
"""
    INPUT: path to a pdf file
    OUTPUT: % of pages OCR'd which include text
"""
total_pages = 0
total_scanned_pages = 0
with fitz.open(filepath) as doc:
    for page in doc:
        text = page.getText().strip()
        if len(text) == 0:
            # Ignore "empty" pages
            continue
        total_pages += 1
        pix1 = page.getPixmap(alpha=False)  # render page to an image
        remove_all_text(doc, page)
        pix2 = page.getPixmap(alpha=False)
        img1 = pix1.getImageData("png")
        img2 = pix2.getImageData("png")
        if img1 == img2:
            # print(f"{page.number} was scanned or has no text")
            if len(text) > 0:
                # print(f"\tHas text of length {len(text):,} characters")
                total_scanned_pages += 1
        else:
            pass
if total_pages == 0:
    return 0
return (total_scanned_pages / total_pages) * 100

This function will give a 100 (or close to it) if the pdf is an image containing an OCR’d text, and a 0 if it’s a native digital pdf.

remove all text:

def remove_all_text(doc, page):
    """Removes all text from a doc pdf page (metadata)"""
    page.cleanContents()  # syntax cleaning of page appearance commands

    # xref of the cleaned command source (bytes object)
    xref = page.getContents()[0]

    cont = doc.xrefStream(xref)  # read it
    # The metadata is stored, it extracts it as bytes. Then searches fot the tags refering to text and deletes it.
    ba_cont = bytearray(cont)  # a modifyable version
    pos = 0
    changed = False  # switch indicates changes
    while pos < len(cont) - 1:
        pos = ba_cont.find(b"BT\n", pos)  # begin text object
        if pos < 0:
            break  # not (more) found
        pos2 = ba_cont.find(b"ET\n", pos)  # end text object
        if pos2 <= pos:
            break  # major error in PDF page definition!
        ba_cont[pos: pos2 + 2] = b""  # remove text object
        changed = True
    if changed:  # we have indeed removed some text
        doc.updateStream(xref, ba_cont)  # write back command stream w/o text

Similar Posts