[英]Group separate strings of an OCR-Result based on coordinates in the image
I use easyocr to read the key figures from an image (display output of measuring instrument).我使用 easyocr 从图像中读取关键数字(测量仪器显示 output)。 Because of different proportions of characters on the picture, some characters/strings, that are meant to be one unit, like value and unit (eg "230 Volt"), are recognised as separate strings ("230", "Volt").
由于图片上字符的比例不同,一些本应作为一个单位的字符/字符串,如值和单位(例如“230 伏特”),被识别为单独的字符串(“230”、“伏特”)。 Another example are multiline strings, where each line is recognised as separate string.
另一个例子是多行字符串,其中每一行都被识别为单独的字符串。 To illustrate it I prepared a picture.
为了说明这一点,我准备了一张图片。 Its a little bit exaggerated but I hope it´s easy to understand.
它有点夸张,但我希望它很容易理解。
Example picture to illustrate the problem示例图片来说明问题
I try to find the elements that are on the same line or column (and very close to each other) and concatenate these strings.我试图找到位于同一行或同一列(并且彼此非常接近)的元素并将这些字符串连接起来。
(Coordinates are from top-left corner to bottom-left corner in clockwise direction) (坐标是从左上角到左下角顺时针方向)
([[239, 31], [563, 31], [563, 195], [239, 195]], '230', 0.7262734770774841)
([[591, 147], [661, 147], [661, 183], [591, 183]], 'Volt', 0.983400155647826)
([[801, 171], [1039, 171], [1039, 239], [801, 239]], 'This is a', 0.9870205241250117)
([[802, 256], [1232, 256], [1232, 328], [802, 328]], 'sentence with', 0.9997852752308181)
([[805, 341], [1065, 341], [1065, 427], [805, 427]], 'multiple', 0.9999849956753041)
([[212, 427], [311, 427], [311, 479], [212, 479]], 'Text', 0.9999873638153076)
([[362, 428], [474, 428], [474, 476], [362, 476]], 'More', 0.9999922513961792)
([[505, 413], [643, 413], [643, 479], [505, 479]], 'Text', 0.9999755620956421)
([[798, 428], [1136, 428], [1136, 500], [798, 500]], 'linebreaks.', 0.8525006562415545)
([[317, 601], [479, 601], [479, 669], [317, 669]], 'More', 0.9999911785125732)
([[529, 603], [665, 603], [665, 669], [529, 669]], 'Text', 0.9757571413464591)
([[699, 603], [841, 603], [841, 669], [699, 669]], 'with', 0.9999924302101135)
([[950, 608], [1182, 608], [1182, 683], [950, 683]], 'spaces.', 0.8026406194725301)
I tried to handle it as Dataframe and split the values to x and y for each point.我尝试将其处理为 Dataframe 并将每个点的值拆分为 x 和 y。 I though this view will help me.
我认为这种观点会对我有所帮助。 But i am still stucked
但我还是卡住了
Text Score tl_x tl_y tr_x tr_y bl_x bl_y br_x br_y
0 230 0.726273 239 31 563 31 239 195 563 195
1 Volt 0.983400 591 147 661 147 591 183 661 183
2 This is a 0.987021 801 171 1039 171 801 239 1039 239
3 sentence with 0.999785 802 256 1232 256 802 328 1232 328
4 multiple 0.999985 805 341 1065 341 805 427 1065 427
5 Text 0.999987 212 427 311 427 212 479 311 479
6 More 0.999992 362 428 474 428 362 476 474 476
7 Text 0.999976 505 413 643 413 505 479 643 479
8 linebreaks. 0.852501 798 428 1136 428 798 500 1136 500
9 More 0.999991 317 601 479 601 317 669 479 669
10 Text 0.975757 529 603 665 603 529 669 665 669
11 with 0.999992 699 603 841 603 699 669 841 669
12 spaces. 0.802641 950 608 1182 608 950 683 1182 683
I`m happy with a list of concatinated strings我对连接字符串列表感到满意
["230 Volt", "This is a sentence with multiple linebreaks.","More text with spaces", ...]
I am sure there is a very simple solution to this, just I am not good enough at programming yet to see it.我确信有一个非常简单的解决方案,只是我还不够擅长编程,还没有看到它。
What would be the best approach to find the closest neighbours (on the same line/column) and group them together?找到最近的邻居(在同一行/列上)并将它们组合在一起的最佳方法是什么?
Every time you will similar pattern text like in above case you get 230 volts
...Like in another example you will get 320 volts?
每次你会像上面的情况一样出现类似的模式文本,你会得到
230 volts
......就像在另一个例子中你会得到320 volts?
...So which will be formatted as x volts?
...那么哪个将被格式化为
x volts?
If so如果是这样的话
import pandas as pd
c1_str = ' '.join(df["Text"])
c1_str = c1_str.replace('linebreaks', 'linebreaks|')
c1_str = c1_str.replace('Volt', 'Volt|')
mask =c1_str.split('|')
print(mask)
Gives #给#
['230 Volt', ' This is a sentence with multiple linebreaks', '. More text with spaces']
Convert df column to string & manuplate string according to pattern you want by creating split patterns.通过创建拆分模式,根据您想要的模式将 df 列转换为字符串和 manuplate 字符串。 Converting string to list based on split pattern
|
基于拆分模式将字符串转换为列表
|
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.