简体   繁体   English

改善tesseract结果(pytesseract)

[英]Improve tesseract results (pytesseract)

I have been downloading tiles from a TMS-like server displaying some "google tiles" with geodatas. 我一直在从类似TMS的服务器下载图块,其中显示了带有地理数据的“ Google图块”。 Those datas are french townships, set in some particular colors regarding to the map's legend. 这些数据是法国的乡镇,以关于地图图例的特定颜色设置。

I have written an algorithm which mainly uses PIL to process tiles (as pictures) before presenting those to tesseract (using pytesseract). 我编写了一种算法,该算法主要使用PIL处理图块(作为图片),然后再将其呈现给tesseract(使用pytesseract)。 In the end, knowing the tile's position (and therfore knowing which townships may be in the area), I use fuzzywuzzy process.extractBests to try to identify which township has been found. 最后,知道瓷砖的位置(并因此知道该地区可能有哪些乡镇),我使用Fuzzywuzzy process.extractBests尝试确定已找到哪个乡镇。

So far, some pictures which don't handle any result from tesseract seem relatively fine to me (though it is not perfect, you can distincly read the french name "Sainte Honorine de Ducy") : 到目前为止,有些图片无法处理tesseract的任何结果对我来说似乎还不错(尽管它并不完美,但您可以清楚地读出法语名称“ Sainte Honorine de Ducy”) tesseract无法正确读取图块的一个示例

I should precise that in this case, the original tile is around 1500x3000 pixels (I have already been extending the tile's size). 我应该指出,在这种情况下,原始图块的大小约为1500x3000像素(我已经在扩展图块的大小)。

I have also modified pytesseract to pass the 'bazaar' keyword which was mentionned in the doc plus custom 'user-words' containing townships from the area. 我还修改了pytesseract,以传递doc中提到的'bazaar'关键字以及包含该地区乡镇的自定义'user-words'。 That beeing said, I could only find "bazaar" reference in tesseract 1 documentation , without anything better than a link in newest documentations. 那只蜜蜂说,我只能在tesseract 1文档中找到“集市”参考,没有什么比最新文档中的链接更好。 In fact, I seem to remember a post somewhere about it beeing a mistake in documentation... For what it's worth, it doesn't seem to change the results here. 实际上,我似乎还记得某处关于它在文档中出现错误的帖子……就其价值而言,它似乎并没有改变这里的结果。

Do you have any suggestions ? 你有什么建议吗 ? In particular, do you think the picture's quality would be considered good enough to expect solid results ? 特别是,您是否认为图片质量足以达到预期的效果?

I know almost nothing about training tesseract on my own with these particular font. 我对使用这些特殊字体独自训练tesseract几乎一无所知。 Considering that (and that I don't manage the datasource, that I don't even know what font is used...), I hope that you may have better suggestions than taking this (huge) leap... 考虑到这一点(并且我不管理数据源,甚至不知道使用什么字体...),我希望您可能比采取这种(巨大的)飞跃有更好的建议...

PS : I know I maybe shouldn't have posted this question whithout any code, but I'm more in a need for global guidance here... I will post any recquired code anyway ! PS:我知道我可能不应该在没有任何代码的情况下发布此问题,但是我在这里更需要全局指导...无论如何我都会发布任何所需的代码!

I think that the problem is that the text is too small compared to the image size. 我认为问题在于文本与图像大小相比太小。

You should apply some more image transformations to find a more exact area where the text is located, try something as morphological transformations and then find the contours of the area with the text. 您应该应用更多图像变换,以找到文本所处的更精确区域,尝试进行形态学变换 ,然后使用文本找到该区域的轮廓 Take a look also as this tutorial, it's with OpenCV. 还要看一下教程,它是与OpenCV一起使用的。

I tried to crop the image with GIMP and then resized it to make a little bigger: 我尝试使用GIMP裁剪图像,然后将其调整为更大的尺寸:

在此处输入图片说明

The result with pytesseract is: pytesseract的结果是:

Saiptnmnorine-de-Ducy

that is acceptable, with some other processings with fuzzywuzzy you could get the right name. 这是可以接受的,在其他一些带有Fuzzywuzzy的处理中,您可以获得正确的名称。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM