[英]Output of the text extracted from tesseract ocr
I am using google tesseract to extract text from images.我正在使用 google tesseract 从图像中提取文本。 I have a folder where I have some images and I wanted to store the extracted text in form of a text file.
我有一个文件夹,里面有一些图像,我想以文本文件的形式存储提取的文本。 The results are okay but there is some red boxes shown in the output.txt file.
结果没问题,但 output.txt 文件中显示了一些红色框。
Here is my code for text extraction from a folder这是我从文件夹中提取文本的代码
import cv2
import pytesseract as pt
import os
custom_config = "--oem 3 --psm 6"
path ="/home/rakshit/Documents/textextraction/croped/82092117"
textBox = []
for filename in os.listdir(path):
head = os.path.split(filename)
file_name = head[1].split('_',1)[0]
imagePath = os.path.join(path, filename)
img = cv2.imread(imagePath)
text = pt.image_to_string(image, config = custom_config)
textBox.append(text)
finalPath = f"/home/rakshit/Documents/textextraction/outputText/detected/{file_name}.txt"
with open(finalPath, 'w') as f:
for t in textBox:
f.write(t)
f.write("\n")
The output text is something like this: Output text file image output 文本是这样的: Output 文本文件图像
Can someone tell me what are these boxes have appeared in the output text file?谁能告诉我 output 文本文件中出现的这些框是什么? Thanks in advance for any time you have devoted to this problem.
提前感谢您花时间解决这个问题。
Sharing a screen image of text file is not a good manner.共享文本文件的屏幕图像不是一个好的方式。 But I guess it is the page break .
但我猜是分页符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.