简体   繁体   English

将 PDF 文件转换为 .txt python 3

[英]Convert PDF file to .txt python 3

from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
import sys, getopt

#converts pdf, returns its text content as a string
def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = StringIO
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    filepath = open(fname, 'rb')
    for page in PDFPage.get_pages(filepath, pagenums):
        interpreter.process_page(page)
    filepath.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text 

def convertMultiple(pdfDir, txtDir):
    if pdfDir == "": pdfDir = os.getcwd() + "\\" #if no pdfDir passed in 
    for pdf in os.listdir(pdfDir): #iterate through pdfs in pdf directory
        fileExtension = pdf.split(".")[-1]
        if fileExtension == "pdf":
            pdfFilename = pdfDir + pdf 
            text = convert(pdfFilename) #get string of text content of pdf
            textFilename = txtDir + pdf + ".txt"
            textFile = open(textFilename, "w") #make text file
            textFile.write(text) #write text to text file
            #textFile.close

pdfDir = (r"FK_EPPS")
txtDir = (r"FK_txt")
convertMultiple(pdfDir, txtDir)

I tried to convert multiple pdf files called FK_EPPS into txt files and write it in different folder called FK_txt.我尝试将多个名为 FK_EPPS 的 pdf 文件转换为 txt 文件并将其写入名为 FK_txt 的不同文件夹中。 But it says that there is no such files or directory.但它说没有这样的文件或目录。 I put the folder exactly in those path.我将文件夹完全放在那些路径中。 I try find the solution but still there is an error.我尝试找到解决方案,但仍然存在错误。 Can you help me why this is happen?你能帮我为什么会这样吗?

/usr/local/lib/python2.7/dist-packages/pdfminer/__init__.py:20: UserWarning: On January 1st, 2020, pdfminer.six will stop supporting Python 2. Please upgrade to Python 3. For more information see https://github.com/pdfminer/pdfminer.six/issues/194
  warnings.warn('On January 1st, 2020, pdfminer.six will stop supporting Python 2. Please upgrade to Python 3. For '
Traceback (most recent call last):
  File "/home/a1-re/Documents/pdftotext/1.py", line 44, in <module>
    convertMultiple(pdfDir, txtDir)
  File "/home/a1-re/Documents/pdftotext/1.py", line 36, in convertMultiple
    text = convert(pdfFilename) #get string of text content of pdf
  File "/home/a1-re/Documents/pdftotext/1.py", line 21, in convert
    filepath = file(fname, 'rb')
IOError: [Errno 2] No such file or directory: 'pdf1831150030.pdf'

(There is no way the traceback that you show is correct. With your sample input, the error should have contained FK_EPPS at the start.) (您显示的回溯不可能是正确的。使用您的示例输入,错误应该在开始时包含FK_EPPS 。)

You forget that a path and filename must be separated from each other with the appropriate separator for your OS.您忘记了路径和文件名必须使用适合您的操作系统的适当分隔符彼此分开。

You could immediately have seen this if you had printed out the value of fname at the start of that convert function.如果您在该convert函数的开始处打印出fname的值,您就可以立即看到这一点。 You make the same mistake for the text output filename, but that would be harder to notice because it would not yield an error, but only create a wrong filename.您对文本输出文件名犯了同样的错误,但这将更难注意到,因为它不会产生错误,而只会创建错误的文件名。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM