简体   繁体   中英

Generate .txt files from pdf files keeping the name same as in pdf using python

I have a directory containing pdf files. I have written the code that performs OCR when you pass a filename to an object of the wand.image class. What I want to do presently is to loop over the directory of pdf files and generate a OCR'd txt file for each pdf and save it some directory. The code that I have written till now is as follows:

import io
from PIL import Image
import pytesseract
from wand.image import Image as wi




pdf = wi(filename = r"D:\files\aba7d525-04b8-4474-a40d-e94f9656ed42.pdf", resolution = 300)

pdfImg = pdf.convert('jpeg')

imgBlobs = []

for img in pdfImg.sequence:
    page = wi(image = img)
    imgBlobs.append(page.make_blob('jpeg'))

extracted_text = []

for imgBlob in imgBlobs:
    im = Image.open(io.BytesIO(imgBlob))
    text = pytesseract.image_to_string(im, lang = 'eng')
    extracted_text.append(text)

print(extracted_text[0])

The thing is if you see my code, ("pdf = .."), I have hardcoded a filename in my code but I need to pass a directory there so that all the files in that directory can be OCR'd and also I need to take as output all those files with their filenames with just .pdf being replaced by .txt. How can I do that

You can use glob

Example:

import os
import glob
from wand.image import Image as wi

files = glob.glob("D:\files\*")

for file in files:
    pdf = wi(filename = file, resolution = 300)
    # write your code
    with open("D:\extracted_files\" + os.path.split(file)[-1].split(".")[0] + ".txt", 'w') as f:
        f.write(extracted_text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM