简体   繁体   English

图片到文字-在python 2.7中删除非ascii字符

[英]image to text - remove non-ascii chars in python 2.7

I am using pytesser to OCR a small image and get a string from it: 我正在使用pytesser OCR小图像并从中获取字符串:

image= Image.open(ImagePath)
text = image_to_string(image)
print text

However, the pytesser loves to sometimes recognize and return non-ascii characters. 但是,pytesser喜欢有时识别并返回非ASCII字符。 The problem occurs when I want to now print what I just recognized. 当我现在要打印刚刚识别的内容时,就会出现问题。 In python 2.7 (which is what I am using), the program crashes. 在python 2.7中(这是我正在使用的),程序崩溃。

Is there some way to make it so pytesser does not return any non-ascii characters? 有什么办法可以使pytesser不返回任何非ascii字符? Perhaps there is something you can change in tesseract OCR? 也许您可以在tesseract OCR中进行一些更改?

Or, is there some way to test a string for non-ascii characters (without crashing the program) and then just not print that line? 或者,是否有某种方法可以测试字符串中的非ASCII字符(而不会导致程序崩溃),然后仅不打印该行?

Some would suggest using python 3.4 but from my research it looks like pytesser does not work with it: Pytesser in Python 3.4: name 'image_to_string' is not defined? 有人会建议使用python 3.4,但根据我的研究,似乎pytesser无法使用它: Python 3.4中的pytesser:名称'image_to_string'未定义吗?

I would go with Unidecode . 我会选择Unidecode This library converts non-ASCII characters to most similar ASCII representation. 该库将非ASCII字符转换为最相似的ASCII表示形式。

import unidecode
image = Image.open(ImagePath)
text = image_to_string(image)
print unidecode(text)

It should work perfectly! 它应该完美工作!

Is there some way to make it so pytesser does not return any non-ascii characters? 有什么办法可以使pytesser不返回任何非ascii字符?

You could limit the characters recognizable by tesseract by using the option tessedit_char_whitelist . 您可以通过使用tessedit_char_whitelist选项来限制tesseract可以识别的字符。

For instance: 例如:

import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
image= Image.open(ImagePath)
text = image_to_string(image,
    config="-c tessedit_char_whitelist=%s_-." % char_whitelist)
print text

See also: https://github.com/tesseract-ocr/tesseract/wiki/FAQ-Old#how-do-i-recognize-only-digits 另请参阅: https : //github.com/tesseract-ocr/tesseract/wiki/FAQ-Old#how-do-i-recognize-only-digits

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM