简体   繁体   English

Python中使用Tesseract OCR的UnicodeDecodeError

[英]UnicodeDecodeError with Tesseract OCR in Python

Iam trying to extract text from an image file using Tesseract OCR in Python but I'am facing an Error that i can figure out how to deal with it. 我试图使用Python中的Tesseract OCR从图像文件中提取文本,但我面临一个错误,我可以弄清楚如何处理它。 all my environment is good as i tested some sample image with the ocr in python! 所有我的环境都很好,因为我在python中使用ocr测试了一些示例图像!

here is the code 这是代码

from PIL import Image
import pytesseract
strs = pytesseract.image_to_string(Image.open('binarized_image.png'))

print (strs)

the follow is the error I get from eclipse console 以下是我从eclipse控制台获得的错误

strs = pytesseract.image_to_string(Image.open('binarized_body.png'))
  File "C:\Python35x64\lib\site-packages\pytesseract\pytesseract.py", line 167, in image_to_string
    return f.read().strip()
  File "C:\Python35x64\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 20: character maps to <undefined>

Iam using python 3.5 x64 on Windows10 我在Windows10上使用python 3.5 x64

The problem is that python is trying to use the console's encoding (CP1252) instead of what it's meant to use (UTF-8). 问题是python试图使用控制台的编码(CP1252)而不是它的意思(UTF-8)。 PyTesseract has found a unicode character and is now trying to translate it into CP1252, which it can't do. PyTesseract找到了一个unicode字符,现在正在尝试将其转换为CP1252,这是它无法做到的。 On another platform you won't encounter this error because it will get to use UTF-8. 在另一个平台上,您不会遇到此错误,因为它将使用UTF-8。

You can try using a different function (possibly one that returns bytes instead of str so you won't have to worry about encoding). 您可以尝试使用不同的函数(可能是一个返回bytes而不是str函数,因此您不必担心编码)。 You could change the default encoding of python as mentioned in one of the comments, although that will cause problems when you go to try and print the string on the windows console. 您可以更改其中一条注释中提到的python的默认编码,但是当您尝试在Windows控制台上打印字符串时会导致问题。 Or, and this is my recommended solution, you could download Cygwin and run python on that to get a clean UTF-8 output. 或者,这是我推荐的解决方案,您可以下载Cygwin并在其上运行python以获得干净的UTF-8输出。

If you want a quick and dirty solution that won't break anything (yet), here's a way that you might consider: 如果你想要一个不会破坏任何东西的快速而肮脏的解决方案,这里有一种你可以考虑的方法:

import builtins

original_open = open
def bin_open(filename, mode='rb'):       # note, the default mode now opens in binary
    return original_open(filename, mode)

from PIL import Image
import pytesseract

img = Image.open('binarized_image.png')

try:
    builtins.open = bin_open
    bts = pytesseract.image_to_string(img)
finally:
    builtins.open = original_open

print(str(bts, 'cp1252', 'ignore'))

I've had the same problem as you but I had to save the output of pytesseract to a file. 我遇到了和你一样的问题但是我必须将pytesseract的输出保存到文件中。 So, I created a function for ocr with pytesseract and when saving to a file added parameter encoding='utf-8' so my function now looks like this: 所以,我用pytesseract为ocr创建了一个函数,当保存到文件时添加了参数encoding='utf-8'所以我的函数现在看起来像这样:

def image_ocr(image_path, output_txt_file_name):
  image_text = pytesseract.image_to_string(image_path, lang='eng+ces', config='--psm 1')
  with open(output_txt_file_name, 'w+', encoding='utf-8') as f:
    f.write(image_text)

I hope this helps someone :) 我希望这可以帮助别人 :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM