简体繁体 English

是否可以使用OCR Engine识别Micr字体？

[英]Recognize Micr font using OCR Engine?

原文 2016-08-08 08:17:34 8 2 windows-runtime/ windows-phone/ ocr/ microsoft-ocr

I am using Microsoft OCR Library for reading text. 我正在使用Microsoft OCR库读取文本。

The Microsoft OCR library works perfectly. Microsoft OCR库可以完美运行。 However i want to read the following list of characters given in the link http://www.ict4u.net/databases/database-images/micr.jpg . 但是，我想阅读链接http://www.ict4u.net/databases/database-images/micr.jpg中给出的以下字符列表。 Is there a way in which i can train the OCR library to read the following characters or is there a language that allows to read the following characters. 有没有一种方法可以训练OCR库读取以下字符，或者是否有允许读取以下字符的语言。

2 个解决方案

[Microsoft OCR crew here] We don't yet support training OCR to customize it for your use-cases. [Microsoft OCR工作人员在这里]我们尚不支持培训OCR以针对您的用例进行自定义。 However, we do actively keep an eye on stackoverflow to see what developers need, so we can keep improving the OCR engine. 但是，我们确实会密切关注stackoverflow以了解开发人员的需求，因此我们可以不断改进OCR引擎。

I have been working with Microsoft OCR for a while now. 我已经使用Microsoft OCR已有一段时间了。 Compared with Tesseract it has very basic functionality. 与Tesseract相比，它具有非常基本的功能。

For example Microsoft OCR returns the words and lines. 例如，Microsoft OCR返回单词和行。 But the lines are nonsense. 但是这些话是胡说八道。 Randomly 2 or 3 words are grouped together as a "line" but they are not a real line. 随机将2个或3个单词组合为一条“线”，但它们不是实线。 And the "lines" are completely unordered. 而且“行”是完全无序的。 In this aspect it is worse than Tesseract. 在这方面，它比Tesseract差。 You have to take the coordinates of each word and order them on your own. 您必须获取每个单词的坐标并自行排序。

Microsoft does not return the rectangles of characters and there is absolutely no way to configure or train Microsoft OCR in any way. Microsoft不返回字符的矩形，并且绝对没有任何方式可以配置或训练Microsoft OCR。 You can add languages with Windows Update for "Basic Typing" = OCR (see http://www.thewindowsclub.com/install-uninstall-languages-windows-10 ), but you cannot train your own language data. 您可以使用Windows Update的“基本键入” = OCR添加语言（请参见http://www.thewindowsclub.com/install-uninstall-languages-windows-10 ），但是您不能训练自己的语言数据。

MSDN says that the following 25 languages are supported with different accuracy: MSDN表示支持以下25种语言，但准确性不同：

Excellent: Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Polish, Portuguese, Romanian, Serbian Cyrillic, Serbian Latin, Slovak, Spanish and Swedish. 优秀：捷克语，丹麦语，荷兰语，英语，芬兰语，法语，德语，匈牙利语，意大利语，挪威语，波兰语，葡萄牙语，罗马尼亚语，塞尔维亚西里尔字母，塞尔维亚拉丁语，斯洛伐克语，西班牙语和瑞典语。
Very good: Chinese Simplified, Greek, Japanese, Russian and Turkish. 很好：简体中文，希腊文，日文，俄文和土耳其文。
Good: Chinese Traditional and Korean. 好：繁体中文和韩文。

The recognition quality is very similar to Tesseract. 识别质量与Tesseract非常相似。 It has even exactly the same problems as Tesseract. 它甚至具有与Tesseract完全相同的问题。 Some single characters are not recognized (separate symbols like a single '$') and it has the same huge problem with asterisks as Tesseract. 某些单个字符无法识别（单独的符号，如单个“ $”），并且与Tesseract一样，它在星号方面也存在巨大的问题。 Also does it insert spaces at the wrong places as Tesseract does. 它也会像Tesseract一样在错误的位置插入空格。 So I ask myself if Microsoft is using Tesseract under the hood? 所以我问自己微软是否在后台使用Tesseract？

However Microsoft OCR has an advantage over Tesseract: The image preprocessing is much better. 但是，Microsoft OCR优于Tesseract：图像预处理要好得多。 It does not matter if you have red text on yellow background or white text on black. 在黄色背景上有红色文本还是在黑色上有白色文本都没有关系。 This is a catch for Tesseract which needs a black and white image of good quality as input. 这是Tesseract的收获，需要高质量的黑白图像作为输入。

For both OCR libraries applies: If you have recognition problems, try to amplify the image. 对于两个OCR库都适用：如果遇到识别问题，请尝试放大图像。 Even blurring the image may be very helful because this removes the noise from the image. 即使模糊图像也可能非常令人讨厌，因为这样可以消除图像中的噪点。