简体   繁体   English

如何使用tesseract python 3读取目录中的所有pdf文件并转换为文本文件?

[英]How to read all pdf files in a directory and convert to text file using tesseract python 3?

How to read all pdf files in a directory and convert to text file using tesseract python 3?如何使用tesseract python 3读取目录中的所有pdf文件并转换为文本文件?

The below code is for reading one pdf file and convert to text file.以下代码用于读取一个 pdf 文件并转换为文本文件。

But i want to read all pdf files in a directory and convert to text file using tesseract python 3但我想读取目录中的所有 pdf 文件并使用 tesseract python 3 转换为文本文件

 from PIL import Image 
 import pytesseract 
 import sys 
 from pdf2image import convert_from_path 
 import os 

 pdf_filename = "pdffile_name.pdf"
 txt_filename = "text_file_created.txt"

 def tesseract(pdf_filename,txt_filename): 
      PDF_file = pdf_filename
      pages = convert_from_path(PDF_file, 500)  
      image_counter = 1

     for page in pages:  
        pdf_filename = "page_"+str(image_counter)+".jpg"
        page.save(pdf_filename, 'JPEG') 
        image_counter = image_counter + 1

filelimit = image_counter-1
outfile = txt_filename
f = open(outfile, "a",encoding = "utf-8") 

for i in range(1, filelimit + 1): 
    pdf_filename = "page_"+str(i)+".jpg"
    text = str(((pytesseract.image_to_string(Image.open(pdf_filename))))) 
    text = text.replace('-\n', '')
    f.write(text) 

f.close() 
f1 = open(outfile, "r",encoding = "utf-8") 
text_list = f1.readlines()
return text_list

tesseract(pdf_filename,txt_filename)`enter code here`

i have code for reading pdf files in a directory but i dont know to combine this code with above code我有读取目录中 pdf 文件的代码,但我不知道将此代码与上述代码结合使用

def readfiles():
os.chdir(path)
pdfs = []
for file_list in glob.glob("*.pdf"):
    print(file_list)
    pdfs.append(file_list)

readfiles()

Simply convert the variable pdf_filename to a list using this code snippet:只需使用以下代码片段将变量pdf_filename转换为列表:

import glob

pdf_filename = [f for f in glob.glob("your_preferred_path/*.pdf")]

which will get you all the pdf files you want and store it into a list.这将为您提供您想要的所有 pdf 文件并将其存储到列表中。

Or simply use any of the methods posted here:或者简单地使用这里发布的任何方法:

How do I list all files of a directory? 如何列出目录中的所有文件?

Once you do that, you now have a list of pdf files.完成此操作后,您现在拥有一个 pdf 文件列表。

Now iterate over the list of pdfs, one at a time, which will give you a list of test files.现在迭代 pdf 列表,一次一个,这将为您提供测试文件列表。

You can use it something like this code snippet:您可以像下面这样使用它:

for one_pdf in pdf_filename:

#* your code to convert the files *#

Hope this helps.希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM