[英]How do I convert scanned PDF into searchable PDF in Python (Mac)? e.g. OCRMYPDF module
I am writing a program in python that can read pdf document, extract text from the document and rename the document using extracted text.我正在 python 中编写一个程序,它可以读取 pdf 文档,从文档中提取文本并使用提取的文本重命名文档。 At first, the scanned pdf document is not searchable.
起初,扫描的 pdf 文档不可搜索。 I would like to convert the pdf into searchable pdf on Python instead of using Google doc, Cisdem pdf converter.
我想将Z437175BA4191210EE 004E1D937494D09Z转换为可搜索的Z437175BA4191210EE004E1D937494D094D094D094D094D094D在za77f554f35426b927411fc92741b561b5633382174 ectrable in za7494 fcroveem
I have read about ocrmypdf module which can used to solve this.我已经阅读了可用于解决此问题的 ocrmypdf 模块。 However, I do not know how to write the code due to my limited knowledge.
但是,由于我的知识有限,我不知道如何编写代码。
I expect the output to convert the scanned pdf into searchable pdf.我希望 output 将扫描的 pdf 转换为可搜索的 pdf。
I suggest working on the working through the turoial, will maybe take you some time but it should be wortht it.我建议通过 turoial 进行工作,可能会花费您一些时间,但这应该是值得的。
I'm not exactly sure what you exactly want.我不确定你到底想要什么。 In my project the settings below work fine in Most of the Cases.
在我的项目中,以下设置在大多数情况下都可以正常工作。
import ocrmypdf , tesseract def ocr(file_path, save_path): ocrmypdf.ocr(file_path, save_path, rotate_pages=True, remove_background=True,language="en", deskew=True, force_ocr=True)
This would be done well into two steps这将分两步完成
Create Python OCR Python function import ocrmypdf def ocr(file_path, save_path): ocrmypdf.ocr(file_path, save_path)创建 Python OCR Python 函数 import ocrmypdf def ocr(file_path, save_path): ocrmypdf.ocr(file_path, save_path)
Call and use a function.调用并使用一个函数。 ocr("input.pdf","output.pdf")
ocr("input.pdf","output.pdf")
Thank you, if you got any question ask please.谢谢,有问题请追问。
I have also faced the same issues with scanned pdf files.我在扫描 pdf 文件时也遇到了同样的问题。 I found a solution to handle this with these 3 lines of code.
我找到了用这 3 行代码来处理这个问题的解决方案。 This code can convert a scanned pdf document into a searchable and select a text in pdf document.
此代码可以将扫描的 pdf 文档转换为可搜索的 select 文档中的 pdf 文档。
import ocrmypdf
def scannedPdfConverter(file_path, save_path):
ocrmypdf.ocr(file_path, save_path, skip_text=True)
print('File converted successfully!')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.