Is there any Python way to identify if the PDF has been OCR’d (the quality of the text is bad) vs a searchable PDF (the quality of the text is perfect)?
Question Description: Check if a PDF searchable has been OCR’d Or is a PDF searchable TRUE
Using metadata of pdf
import pprint
import PyPDF2
def get_doc_info(path):
pp = pprint.PrettyPrinter(indent =4)
pdf_file = PyPDF2.PdfFileReader(path, 'rb')
doc_info = pdf_file.getDocumentInfo()
pp.pprint(doc_info)
I find :
result = get_doc_info(PDF_SEARCHABLE_HAS_BEEN_OCRD.pdf)
{ '/Author': 'NAPS2',
'/CreationDate': "D:20200701104101+02'00'",
'/Creator': 'NAPS2',
'/Keywords': '',
'/ModDate': "D:20200701104101+02'00'",
'/Producer': 'PDFsharp 1.50.4589 (www.pdfsharp.com)'}
result = get_doc_info(PDF_SEARCHABLE_TRUE.pdf)
{ '/CreationDate': 'D:20210802122000Z',
'/Creator': 'Quadient CXM AG~Inspire~14.3.49.7',
'/Producer': ''}
Can i check the type of the PDF (True PDF or OCR PDF) using Creator from metaData of the PDF?
There is another way to use python?
If there is no solution to the problem, how can I use deep learning/Machine learning to detect the type of the pdf searchable (True or OCR)?
This is a video to understand the difference between TRUE PDF and OCR PDF: https://www.youtube.com/watch?v=xs8KQbxsMcw

Expert Answer
An Expert developed (based on some SO post I cannot recall) this function to solve this problem:
def get_scanned_pages_percentage(filepath: str) -> float:
"""
INPUT: path to a pdf file
OUTPUT: % of pages OCR'd which include text
"""
total_pages = 0
total_scanned_pages = 0
with fitz.open(filepath) as doc:
for page in doc:
text = page.getText().strip()
if len(text) == 0:
# Ignore "empty" pages
continue
total_pages += 1
pix1 = page.getPixmap(alpha=False) # render page to an image
remove_all_text(doc, page)
pix2 = page.getPixmap(alpha=False)
img1 = pix1.getImageData("png")
img2 = pix2.getImageData("png")
if img1 == img2:
# print(f"{page.number} was scanned or has no text")
if len(text) > 0:
# print(f"\tHas text of length {len(text):,} characters")
total_scanned_pages += 1
else:
pass
if total_pages == 0:
return 0
return (total_scanned_pages / total_pages) * 100
This function will give a 100 (or close to it) if the pdf is an image containing an OCR’d text, and a 0 if it’s a native digital pdf.
remove all text:
def remove_all_text(doc, page):
"""Removes all text from a doc pdf page (metadata)"""
page.cleanContents() # syntax cleaning of page appearance commands
# xref of the cleaned command source (bytes object)
xref = page.getContents()[0]
cont = doc.xrefStream(xref) # read it
# The metadata is stored, it extracts it as bytes. Then searches fot the tags refering to text and deletes it.
ba_cont = bytearray(cont) # a modifyable version
pos = 0
changed = False # switch indicates changes
while pos < len(cont) - 1:
pos = ba_cont.find(b"BT\n", pos) # begin text object
if pos < 0:
break # not (more) found
pos2 = ba_cont.find(b"ET\n", pos) # end text object
if pos2 <= pos:
break # major error in PDF page definition!
ba_cont[pos: pos2 + 2] = b"" # remove text object
changed = True
if changed: # we have indeed removed some text
doc.updateStream(xref, ba_cont) # write back command stream w/o text