简体   繁体   English

将pdf文件转换为新目录中的原始文本

[英]Convert pdf files to raw text in new directory

Here is what I'm trying: 这是我正在尝试的:

import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader

import re
import config
import sys
import os

with open(config.ENCRYPTED_FILE_PATH, mode='rb') as f:
        reader = PyPDF2.PdfFileReader(f)
        if reader.isEncrypted:
            reader.decrypt('Password123')
            print(f"Number of page: {reader.getNumPages()}")

            for i in range(reader.numPages):
                output = PdfFileWriter()
                output.addPage(reader.getPage(i))                
                with open("./pdfs/document-page%s.pdf" % i, "wb") as outputStream:
                    output.write(outputStream)
                    print(outputStream)

                    for page in output.pages: # failing here
                        print page.extractText() # failing here

The entire program is decrypting a large pdf file from one location, and splitting into a separate pdf file per page in new directory -- this is working fine. 整个程序正在从一个位置解密一个较大的pdf文件,然后在新目录中的每页上将其拆分为一个单独的pdf文件-效果很好。 However, after this I would like to convert each page to a raw .txt file in a new directory. 但是,在此之后,我想将每个页面转换为新目录中的原始.txt文件。 ie /txt_versions/ (for which I'll use later) /txt_versions/ (稍后将使用)

Ideally, I can use my current imports, ie PyPDF2 without importing/installing more modules/. 理想情况下,我可以使用当前的导入文件,即PyPDF2,而无需导入/安装更多模块/。 Any thoughts? 有什么想法吗?

You have not described how the last two lines are failing, but extract text does not function well on some PDFs: 您尚未描述最后两行是如何失败的,但是提取文本在某些PDF上效果不佳:

def extractText(self): def extractText():

""" Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated. :return: a unicode string object. """ “”“按照在内容流中提供的顺序,找到所有文本绘制命令,然后提取文本。这对于某些PDF文件效果很好,但对于其他PDF文件则效果不佳,具体取决于所使用的生成器。将来,不要依赖此函数的文本顺序,因为如果此函数变得更复杂,它将改变。:return:Unicode字符串对象。

One thing to do is to see if there is text in your pdf. 要做的一件事是查看pdf中是否有文本。 Just because you can see words doesn't mean they have been OCR'd or otherwise encoded in the file as text. 仅仅因为您可以看到单词并不意味着它们已经被OCR编码或以其他方式在文件中被编码为文本。 Attempt to highlight the text in the pdf and copy/paste it into a text file to see what sort of text can even be extracted. 尝试突出显示pdf中的文本并将其复制/粘贴到文本文件中,以查看什至可以提取哪种文本。

If you can't get your solution working you'll need to use another package like Tika . 如果您的解决方案无法正常工作,则需要使用另一个软件包,例如Tika

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将所有PDF文件转换为目录中的文本 - Convert all PDF files into text in a directory 如何使用tesseract python 3读取目录中的所有pdf文件并转换为文本文件? - How to read all pdf files in a directory and convert to text file using tesseract python 3? 在osx上批量转换.py(文本文件)到.pdf - Batch convert .py (text files) to .pdf on osx In Python I'm trying to convert all files in directory from PDF to CSV, then edit csv with Pandas before saving to new folder - In Python I'm trying to convert all files in directory from PDF to CSV, then edit csv with Pandas before saving to new folder 如何将目录/文件夹中的所有pdf文件转换为图像python 3? - How to convert all pdf files in a directory/folder to image python 3? 如何在没有reportlab的情况下将文本文件转换为pdf文件? - how to convert text files to pdf files without reportlab in python? 在python目录中为每个.pdf文件创建一个新的.txt文件 - Create a new .txt file for each .pdf files in a directory in python 如何使用python将多个文件从pdf转换为文本文件 - how to convert multiple files from pdf to text files using python 如何将目录的所有json文件转换为python中的文本文件? - how to convert all json files of directory to text files in python? 如何使用OCR有效地从PDF文件目录中提取文本? - How to extract text from a directory of PDF files efficiently with OCR?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM