使用 Tesseract 对图像进行文本识别

Question

I would like to create a pdf file with text recognition from a scanned image.我想从扫描的图像中创建一个带有文本识别功能的 pdf 文件。

But I don't want the original image in the PDF file, just plain text.但我不想要 PDF 文件中的原始图像，只是纯文本。 The text should be visible so that it can be read, but the font doesn't matter that much.文本应该是可见的，以便可以阅读，但字体并不那么重要。

This Tesseract command does almost what I want, but the text is invisible.这个 Tesseract 命令几乎完成了我想要的，但文本是不可见的。

tesseract -c textonly_pdf=1 test.tif test pdf

How can I make the text visible?如何使文本可见？
Can I create a pdf file with another command-line or python tool?我可以使用其他命令行或 python 工具创建 pdf 文件吗？

I'm running Tesseract in Ubuntu.我在 Ubuntu 中运行 Tesseract。

Answer 1

Here a snippet of code from a script I made in python one year ago to extract the text in a dataframe (that you can then save to csv or other formats).这是我一年前在 python 中制作的脚本中的一段代码，用于提取数据框中的文本（然后您可以将其保存为 csv 或其他格式）。

import cv2
import pytesseract as pya
pya.pytesseract.tesseract_cmd = r'D:\Programs\Tesseract_OCR\tesseract.exe'
from pytesseract import Output

imgcv = cv2.imread('foo.jpg')
# in text_df you have the extracted text, confidence and so on
text_df = pya.image_to_data(imgcv , output_type='data.frame')
text_df = text_df[text_df.conf != -1]
text_df = text_df[text_df.conf > 50]
conf = text_df['conf'].mean()

使用 Tesseract 对图像进行文本识别

问题描述

1 个解决方案

解决方案1
0 2021-11-10 07:51:27

使用 Tesseract 对图像进行文本识别

问题描述

1 个解决方案

解决方案1 0 2021-11-10 07:51:27

解决方案1
0 2021-11-10 07:51:27