Python：遍历目录并将结果写入单独的txt文件

Question

我正在尝试遍历 pdf 文件的目录。 我首先将所有 PDF 转换为 jpeg，最后转换为 txt。 我已经能够遍历 PDF 目录并将每个 jpeg 文件写入单个 txt 文件，但我真正需要的是每个 PDF 的单独 txt 文件。 我了解 pdf 的每一页都被转换为 JPEG 然后写入文本文件的问题。 如果有 2 个 PDF，我想要 2 个 txt 文件。 以下是我到目前为止的代码。 谢谢你的帮助。 从 PIL 导入图像

import pytesseract 
import sys 
from pdf2image import convert_from_path 
import os 
import cv2
import glob

for filepath in glob.iglob("path/*.pdf"):
    PDF_file = filepath
  
    pages = convert_from_path(PDF_file, 500) 
  
    image_counter = 1
  
    for page in pages: 
  
        filename = "page_"+str(image_counter)+".jpg"
      
        page.save(filename, 'JPEG') 
  
        image_counter = image_counter + 1
  
    filelimit = image_counter-1
  
    outfile = "out_text.txt"
  
    f = open(outfile, "a") 
  
    for i in range(1, filelimit + 1): 
  
        filename = "page_"+str(i)+".jpg"
          
        text = str(((pytesseract.image_to_string(Image.open(filename))))) 
  
        text = text.replace('-\n', '')     
  
        f.write(text) 
  
    f.close()

Answer 1

如果您想将 output 放在不同的 pdf 页面的单独文本文件中。 然后，您应该为每个 pdf 的页面以不同的名称打开文件。 像这样：

for i in range(1, filelimit + 1): 
    outfile = "out_text_"+ str(i) +".txt"
    f = open(outfile, "a") 
    filename = "page_"+str(i)+".jpg"
    text = str(((pytesseract.image_to_string(Image.open(filename))))) 
    text = text.replace('-\n', '')     
    f.write(text) 
    f.close()

Python：遍历目录并将结果写入单独的txt文件

问题描述

1 个解决方案

解决方案1
1 2021-03-21 19:38:14

Python：遍历目录并将结果写入单独的txt文件

问题描述

1 个解决方案

解决方案1 1 2021-03-21 19:38:14

解决方案1
1 2021-03-21 19:38:14