简体   繁体   English

使用 Tesseract 对图像进行文本识别

[英]Text recognition of an image with Tesseract

I would like to create a pdf file with text recognition from a scanned image.我想从扫描的图像中创建一个带有文本识别功能的 pdf 文件。

But I don't want the original image in the PDF file, just plain text.但我不想要 PDF 文件中的原始图像,只是纯文本。 The text should be visible so that it can be read, but the font doesn't matter that much.文本应该是可见的,以便可以阅读,但字体并不那么重要。

This Tesseract command does almost what I want, but the text is invisible.这个 Tesseract 命令几乎完成了我想要的,但文本是不可见的。

tesseract -c textonly_pdf=1 test.tif test pdf 
  • How can I make the text visible?如何使文本可见?
  • Can I create a pdf file with another command-line or python tool?我可以使用其他命令行或 python 工具创建 pdf 文件吗?

I'm running Tesseract in Ubuntu.我在 Ubuntu 中运行 Tesseract。

Here a snippet of code from a script I made in python one year ago to extract the text in a dataframe (that you can then save to csv or other formats).这是我一年前在 python 中制作的脚本中的一段代码,用于提取数据框中的文本(然后您可以将其保存为 csv 或其他格式)。

import cv2
import pytesseract as pya
pya.pytesseract.tesseract_cmd = r'D:\Programs\Tesseract_OCR\tesseract.exe'
from pytesseract import Output

imgcv = cv2.imread('foo.jpg')
# in text_df you have the extracted text, confidence and so on
text_df = pya.image_to_data(imgcv , output_type='data.frame')
text_df = text_df[text_df.conf != -1]
text_df = text_df[text_df.conf > 50]
conf = text_df['conf'].mean()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM