将pdf文件转换为新目录中的原始文本

Question

Here is what I'm trying: 这是我正在尝试的：

import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader

import re
import config
import sys
import os

with open(config.ENCRYPTED_FILE_PATH, mode='rb') as f:
        reader = PyPDF2.PdfFileReader(f)
        if reader.isEncrypted:
            reader.decrypt('Password123')
            print(f"Number of page: {reader.getNumPages()}")

            for i in range(reader.numPages):
                output = PdfFileWriter()
                output.addPage(reader.getPage(i))                
                with open("./pdfs/document-page%s.pdf" % i, "wb") as outputStream:
                    output.write(outputStream)
                    print(outputStream)

                    for page in output.pages: # failing here
                        print page.extractText() # failing here

The entire program is decrypting a large pdf file from one location, and splitting into a separate pdf file per page in new directory -- this is working fine. 整个程序正在从一个位置解密一个较大的pdf文件，然后在新目录中的每页上将其拆分为一个单独的pdf文件-效果很好。 However, after this I would like to convert each page to a raw .txt file in a new directory. 但是，在此之后，我想将每个页面转换为新目录中的原始.txt文件。 ie /txt_versions/ (for which I'll use later) 即 /txt_versions/ （稍后将使用）

Ideally, I can use my current imports, ie PyPDF2 without importing/installing more modules/. 理想情况下，我可以使用当前的导入文件，即PyPDF2，而无需导入/安装更多模块/。 Any thoughts? 有什么想法吗？

Answer 1

You have not described how the last two lines are failing, but extract text does not function well on some PDFs: 您尚未描述最后两行是如何失败的，但是提取文本在某些PDF上效果不佳：

def extractText(self): def extractText（）：

""" Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated. :return: a unicode string object. """ “”“按照在内容流中提供的顺序，找到所有文本绘制命令，然后提取文本。这对于某些PDF文件效果很好，但对于其他PDF文件则效果不佳，具体取决于所使用的生成器。将来，不要依赖此函数的文本顺序，因为如果此函数变得更复杂，它将改变。：return：Unicode字符串对象。

One thing to do is to see if there is text in your pdf. 要做的一件事是查看pdf中是否有文本。 Just because you can see words doesn't mean they have been OCR'd or otherwise encoded in the file as text. 仅仅因为您可以看到单词并不意味着它们已经被OCR编码或以其他方式在文件中被编码为文本。 Attempt to highlight the text in the pdf and copy/paste it into a text file to see what sort of text can even be extracted. 尝试突出显示pdf中的文本并将其复制/粘贴到文本文件中，以查看什至可以提取哪种文本。

If you can't get your solution working you'll need to use another package like Tika . 如果您的解决方案无法正常工作，则需要使用另一个软件包，例如Tika 。

将pdf文件转换为新目录中的原始文本

问题描述

1 个解决方案

解决方案1
0 2019-07-23 16:22:57

将pdf文件转换为新目录中的原始文本

问题描述

1 个解决方案

解决方案1 0 2019-07-23 16:22:57

解决方案1
0 2019-07-23 16:22:57