将pdf读入python的最佳实践

Question

我正在尝试将 pdf 文档（我删除了一些敏感数据的内容原因： https : //ufile.io/bgghw ）到 python 中。 我必须使用复选框并根据这些和其他文本执行操作。

我尝试了 PyPDF3，但它只提供了损坏的输出，经过一些研究，我发现 pdfminer 听起来很有希望使用 python 2.7 的缺点。

我不确定是否有其他软件包，或者是否有在 python 中使用 pdf 的最佳实践，因为我得到的所有信息都是几年前的，而且大多数信息都非常相反。 当然，我可以为我的情况选择最好的套餐:)

感谢您的任何建议！

Answer 1

第一个选项：PyPDF2

首先在 cmd 中运行它以安装 PyPDF2 :（可能比您已经尝试过的 PyPDF3 更好）

pip install PyPDF2

然后使用以下代码从 pdf 文件中提取文本：

# importing required modules 
import PyPDF2 

# creating a pdf file object 
pdfFileObj = open('example.pdf', 'rb') 

# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

# printing number of pages in pdf file 
print(pdfReader.numPages) 

# creating a page object 
pageObj = pdfReader.getPage(0) 

# extracting text from page 
print(pageObj.extractText()) 

# closing the pdf file object 
pdfFileObj.close()

第二个选项：Textract

在 cmd 中运行它来安装 texttract

pip install textract

然后阅读pdf使用以下代码：

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

祝你好运！

将pdf读入python的最佳实践

问题描述

1 个解决方案

解决方案1
4 2018-12-26 21:10:18

将pdf读入python的最佳实践

问题描述

1 个解决方案

解决方案1 4 2018-12-26 21:10:18

解决方案1
4 2018-12-26 21:10:18