需要使用python从PDF文件中提取文本

Question

I am trying to extract text from a PDF file, but it gives an error我正在尝试从 PDF 文件中提取文本，但出现错误

PdfReadError: Could not read malformed PDF file

Can anyone guide me with how to proceed with this?任何人都可以指导我如何进行此操作吗？ Here is the code这是代码

import os
import PyPDF2

dir_name='path to folder'
files=os.listdir(dir_name)
os.chdir(dir_name)
for j in files:
     print(j)
     print("In file")
     pdfFileObj = open(j, 'rb')
     pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
     print(pdfReader.numPages)
     pdfFile=pdfReader.getPage(0)
       
     #page_lines=pdfFile.extractText()
     print(pdfFile.extractText())
       
     pdfFileObj.close()

Answer 1

This might be something which is happening cause of the files in the directory you did chdir.这可能是您执行 chdir 目录中的文件的原因。 Make sure it has no other files other than pdf files.确保它没有除 pdf 文件之外的其他文件。 Also try to extract files based on its extension, specially the .pdf .还尝试根据其扩展名提取文件，特别是.pdf 。 Here is similar code.这是类似的代码。 Try executing it just for the files you found are malformed.尝试仅针对您发现格式错误的文件执行它。

import PyPDF2
pdfFileObj = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()

Update更新

It is observed that this module PyPDF2 does not function properly.据观察，该模块PyPDF2无法正常运行。 The module is only good for it's (.numPages) method.该模块仅适用于它的 (.numPages) 方法。 Other methods may or may not work as expected, while sometimes returning nothing.其他方法可能会也可能不会按预期工作，而有时什么也不返回。

Try PdfMiner for robust extraction.尝试PdfMiner进行稳健的提取。 It has a lot of options to explore.它有很多可供探索的选择。 pdfminer pdfminer

需要使用python从PDF文件中提取文本

问题描述

1 个解决方案

解决方案1
0 2020-10-03 15:23:17

需要使用python从PDF文件中提取文本

问题描述

1 个解决方案

解决方案1 0 2020-10-03 15:23:17

解决方案1
0 2020-10-03 15:23:17