[英]Is there a way to OCR all pdf files within one folder using Python?
As the title states, is there a way to OCR all pdf files within one folder using Python? 如标题所述,有没有一种方法可以使用Python对一个文件夹中的所有pdf文件进行OCR? I have this code below, but it only OCR's one file at a time and extract text.
我在下面有此代码,但一次仅OCR的一个文件并提取文本。 I want to do a general OCR of all the pdf in a folder.
我想对文件夹中的所有pdf文件进行常规OCR。 Please let me know if its possible to do so.
请让我知道是否可以这样做。
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()[1]
req_image = []
final_text = []
image_pdf = Image(filename="./PDF_FILE_NAME", resolution=300)
image_jpeg = image_pdf.convert('jpeg')
for img in req_image:
txt = tool.image_to_string(
PI.open(io.BytesIO(img)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
final_text.append(txt)
I like the glob module. 我喜欢glob模块。
You can match against a pattern for a given folder. 您可以针对给定文件夹的模式进行匹配。
Here is your code with some edits to show how it might work. 这是您的代码,并进行了一些编辑以显示其工作方式。
import glob
pdfs = glob.glob("./*.pdf")
for pdf in pdfs:
image_pdf = Image(pdf, resolution=300)
image_jpeg = image_pdf.convert('jpeg')
txt = tool.image_to_string(
PI.open(io.BytesIO(image_jpeg)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
final_text.append(txt)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.