如何使用python將多個文件從pdf轉換為文本文件

Question

我有一個將pdf文件轉換為文本文件的python腳本。 系統要求用戶輸入包含PDF文件的文件夾的路徑。

問題是該腳本僅轉換一個文件，所以需要使該腳本轉換指定目錄中存在的所有PDF文件。

該腳本列出了指定目錄中的所有現有文件，但它將轉換除最后一個文件以外的所有文件

遞增i后的結果

碼：

import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime

def check_path(prompt):
    ''' (str) -> str
    Verifies if the provided absolute path does exist.
    '''
    abs_path = raw_input(prompt)
    while path.exists(abs_path) != True:
        print "\nThe specified path does not exist.\n"
        abs_path = raw_input(prompt)
    return abs_path    

print "\n"

folder = check_path("Provide absolute path for the folder: ")

list=[]
directory=folder
for root,dirs,files in os.walk(directory):
    for filename in files:
        if filename.endswith('.pdf'):
            t=os.path.join(directory,filename)
            list.append(t)

m=len(list)
i=0
while i<=len(list):

    path=list[i]
    head,tail=os.path.split(path)
    var="\\"

    tail=tail.replace(".pdf",".txt")
    name=head+var+tail



    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for j in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(j).extractText() + "\n"
    print strftime("%H:%M:%S"), " pdf  -> txt "
    f=open(name,'w')
    f.write(content.encode('UTF-8'))
    f.close
    i+=1

Answer 1

您錯過了增加變量i的機會。

在python中有一種簡單的方法。

下載並安裝PDFMiner。

然后使用子流程模塊來完成這項工作。

import subprocess

files = [
    'file1.pdf', 'file2.pdf', 'file3.pdf'
]
for f in files:
    cmd = 'python pdf2txt.py -o %s.txt %s' % (f.split('.')[0], f)
    run = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    out, err = run.communicate()
# display errors if they occur    
if err:
    print err

Answer 2

除了不增加while循環的變量i之外，您還在for循環中使用相同的變量名i 。 因此，在離開for循環之后，變量i的值已經更改。 您應該在while和for循環中使用單獨的變量名。

Answer 3

您創建了一個while循環，但是while循環將永遠運行，因為在執行循環之后您沒有更新i值

只需將i+=1放在while循環的底部，然后將for循環更改為

for x in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    content += pdf.getPage(x).extractText() + "\n"

for循環的i干擾了while循環

如何使用python將多個文件從pdf轉換為文本文件

問題描述

碼：

3 個解決方案

解決方案1
1 2017-12-14 06:41:55

解決方案2
1 2017-12-14 06:59:05

解決方案3
0 2017-12-14 06:41:36

如何使用python將多個文件從pdf轉換為文本文件

問題描述

碼：

3 個解決方案

解決方案1 1 2017-12-14 06:41:55

解決方案2 1 2017-12-14 06:59:05

解決方案3 0 2017-12-14 06:41:36

解決方案1
1 2017-12-14 06:41:55

解決方案2
1 2017-12-14 06:59:05

解決方案3
0 2017-12-14 06:41:36