Output 从 tesseract ocr 中提取的文本

Question

I am using google tesseract to extract text from images.我正在使用 google tesseract 从图像中提取文本。 I have a folder where I have some images and I wanted to store the extracted text in form of a text file.我有一个文件夹，里面有一些图像，我想以文本文件的形式存储提取的文本。 The results are okay but there is some red boxes shown in the output.txt file.结果没问题，但 output.txt 文件中显示了一些红色框。

Here is my code for text extraction from a folder这是我从文件夹中提取文本的代码

import cv2
import pytesseract as pt
import os

custom_config = "--oem 3 --psm 6"

path ="/home/rakshit/Documents/textextraction/croped/82092117"

textBox = []
for filename in os.listdir(path):
    head = os.path.split(filename)
    file_name = head[1].split('_',1)[0]

    imagePath = os.path.join(path, filename)
    img = cv2.imread(imagePath)
    text = pt.image_to_string(image, config = custom_config)
    textBox.append(text)

finalPath = f"/home/rakshit/Documents/textextraction/outputText/detected/{file_name}.txt"

with open(finalPath, 'w') as f:
    for t in textBox:
        f.write(t)
        f.write("\n")

The output text is something like this: Output text file image output 文本是这样的： Output 文本文件图像

Can someone tell me what are these boxes have appeared in the output text file?谁能告诉我 output 文本文件中出现的这些框是什么？ Thanks in advance for any time you have devoted to this problem.提前感谢您花时间解决这个问题。

Answer 1

Sharing a screen image of text file is not a good manner.共享文本文件的屏幕图像不是一个好的方式。 But I guess it is the page break .但我猜是分页符。

Output 从 tesseract ocr 中提取的文本

问题描述

1 个解决方案

解决方案1
0 2022-04-16 17:09:59

Output 从 tesseract ocr 中提取的文本

问题描述

1 个解决方案

解决方案1 0 2022-04-16 17:09:59

解决方案1
0 2022-04-16 17:09:59