简体   繁体   English

Output 从 tesseract ocr 中提取的文本

[英]Output of the text extracted from tesseract ocr

I am using google tesseract to extract text from images.我正在使用 google tesseract 从图像中提取文本。 I have a folder where I have some images and I wanted to store the extracted text in form of a text file.我有一个文件夹,里面有一些图像,我想以文本文件的形式存储提取的文本。 The results are okay but there is some red boxes shown in the output.txt file.结果没问题,但 output.txt 文件中显示了一些红色框。

Here is my code for text extraction from a folder这是我从文件夹中提取文本的代码

import cv2
import pytesseract as pt
import os

custom_config = "--oem 3 --psm 6"

path ="/home/rakshit/Documents/textextraction/croped/82092117"

textBox = []
for filename in os.listdir(path):
    head = os.path.split(filename)
    file_name = head[1].split('_',1)[0]

    imagePath = os.path.join(path, filename)
    img = cv2.imread(imagePath)
    text = pt.image_to_string(image, config = custom_config)
    textBox.append(text)

finalPath = f"/home/rakshit/Documents/textextraction/outputText/detected/{file_name}.txt"

with open(finalPath, 'w') as f:
    for t in textBox:
        f.write(t)
        f.write("\n")

The output text is something like this: Output text file image output 文本是这样的: Output 文本文件图像

Can someone tell me what are these boxes have appeared in the output text file?谁能告诉我 output 文本文件中出现的这些框是什么? Thanks in advance for any time you have devoted to this problem.提前感谢您花时间解决这个问题。

Sharing a screen image of text file is not a good manner.共享文本文件的屏幕图像不是一个好的方式。 But I guess it is the page break .但我猜是分页符

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 尝试将提取的文本从Tesseract OCR传递到自定义Google搜索 - Trying to pass extracted text from Tesseract OCR to custom google search Tesseract ocr output 在检测到的文本之间有单个字符 - Tesseract ocr output with single characters in between the detected text 即使输入文本,Tesseract OCR 也会给出非常糟糕的 output - Tesseract OCR gives really bad output even with typed text 如何将从 tesseract 中提取的文本转换为 pandas dataframe - How to convert text extracted from tesseract to pandas dataframe tesseract-ocr使用字符编码从图像中读取文本 - tesseract-ocr reading text from image with character cordination 如何使用 Tesseract OCR 从具有水平线的图像中提取文本? - How to extract at text from an image with horizontal line using Tesseract OCR? 使用 Tesseract OCR 从扫描的 pdf 个文件夹中提取文本 - Use Tesseract OCR to extract text from a scanned pdf folders Python 无法从图像中读取文本 [Python OCR with Tesseract] - Python cannot read text from an image [Python OCR with Tesseract] Tesseract OCR:图像到包含两列文本的文本 - Tesseract OCR: image to text containing 2 columns of text 在OCR / tesseract / OpenCV中是否有任何方法可以从图像的特定区域中提取文本? - Is there any way in OCR/tesseract/OpenCV for extracting text from a particular region of an image?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM