简体   繁体   中英

How do I convert scanned PDF into searchable PDF in Python (Mac)? e.g. OCRMYPDF module

I am writing a program in python that can read pdf document, extract text from the document and rename the document using extracted text. At first, the scanned pdf document is not searchable. I would like to convert the pdf into searchable pdf on Python instead of using Google doc, Cisdem pdf converter.

I have read about ocrmypdf module which can used to solve this. However, I do not know how to write the code due to my limited knowledge.

I expect the output to convert the scanned pdf into searchable pdf.

I suggest working on the working through the turoial, will maybe take you some time but it should be wortht it.

I'm not exactly sure what you exactly want. In my project the settings below work fine in Most of the Cases.

import ocrmypdf , tesseract def ocr(file_path, save_path): ocrmypdf.ocr(file_path, save_path, rotate_pages=True, remove_background=True,language="en", deskew=True, force_ocr=True)

This would be done well into two steps

  1. Create Python OCR Python function import ocrmypdf def ocr(file_path, save_path): ocrmypdf.ocr(file_path, save_path)

  2. Call and use a function. ocr("input.pdf","output.pdf")

Thank you, if you got any question ask please.

I have also faced the same issues with scanned pdf files. I found a solution to handle this with these 3 lines of code. This code can convert a scanned pdf document into a searchable and select a text in pdf document.

import ocrmypdf
def scannedPdfConverter(file_path, save_path):
    ocrmypdf.ocr(file_path, save_path, skip_text=True)
    print('File converted successfully!')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM