简体   繁体   English

当我事先不知道char编码时,如何打印字符串列表?

[英]How do I print a list of strings, when I can't know the char encoding in advance?

I am retrieving a list of names from a webservice using a client I've written in Python. 我正在使用我用Python编写的客户端从Web服务中检索名称列表。 Upon retrieving the list, I encode each name to unicode and then print each of them to stdout. 检索列表后,我将每个名称编码为unicode,然后将每个名称打印到stdout。 When I get to the name "Ólafur Jóhann Ólafsson", I get the following error: 当我获得名称“ÓlafurJóhannÓlafsson”时,出现以下错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: 
                    ordinal not in range(128)

Since I cannot know what the encoding will be, how do I convert all of these strings to unicode? 由于我不知道编码是什么,如何将所有这些字符串转换为unicode? Or can you suggest a better way to handle this problem? 还是可以建议一种更好的方法来解决此问题?

The UnicodeDammit module from BeautifulSoup can automagically detect the encoding. BeautifulSoupUnicodeDammit模块可以自动检测编码。

from BeautifulSoup import UnicodeDammit

u = UnicodeDammit("Ólafur Jóhann Ólafsson")

print u.unicode
print u.originalEncoding

This page may help you http://wiki.python.org/moin/PrintFails 此页面可能会帮助您http://wiki.python.org/moin/PrintFails

The problem, I guess, is that you need to print those names to console. 我想问题是您需要打印这些名称以进行控制台。 Do you really need it? 您真的需要吗? or it's just a test environment? 还是只是一个测试环境? if you use console just for testing, you may switch to other tools like unit testing to check what values you exactly get. 如果您仅使用控制台进行测试,则可以切换到其他工具(例如单元测试)来检查您究竟获得了什么值。

First of all, you decode data to Unicode (the absence of encoding) when reading from a file, pipe, socket, terminal, etc.; 首先,从文件,管道,套接字,终端等读取数据时, 数据解码为Unicode( 编码)。 and encode Unicode to an appropriate byte encoding when sending/persisting data. 并在发送/保留数据时将Unicode 编码为适当的字节编码。 I suspect this is the root of your problem. 我怀疑这是您问题的根源。

The web service should declare the encoding in the headers or data received. Web服务应在标头或接收到的数据中声明编码。 print normally automatically encodes Unicode to the terminal's encoding (discovered through sys.stdout.encoding ) or in absence of that just ascii . print正常情况下, print会自动将Unicode编码为终端的编码(通过sys.stdout.encoding发现),或者在没有ascii情况下自动将其编码。 If the characters in your data are not supported by the target encoding, you'll get a UnicodeEncodeError . 如果目标编码不支持数据中的字符,则会收到UnicodeEncodeError

Since that is not the error you received, you should post some code so we can see what your are doing. 由于这不是您收到的错误,因此您应该发布一些代码,以便我们可以看到您在做什么。 Most likely, you are encoding a byte string instead of decoding . 最有可能的是,您正在编码一个字节字符串,而不是进行解码 Here's an example of this: 这是一个例子:

>>> data = '\xc2\xbd' # UTF-8 encoded 1/2 symbol.
>>> data.encode('cp437')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\dev\python\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

What I did here is call encode on a byte string. 我在这里所做的是在字节字符串上调用encode Since encode requires a Unicode string, Python used the default ascii encoding to decode the byte string to Unicode first, before encoding to cp437 . 由于encode需要Unicode字符串,因此Python使用默认的ascii编码先将字节字符串解码为Unicode,然后再编码为cp437

Fix this by decoding instead of encoding the data, then print will encode to stdout automatically. 通过解码而不是对数据进行编码来解决此问题,然后print将自动编码为stdout。 As long as your terminal supports the characters in the data, it will display properly: 只要您的终端支持数据中的字符,它将正确显示:

>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print data.decode('utf8') # implicit encode to sys.stdout.encoding
½
>>> print data.decode('utf8').encode('cp437') # explicit encode.
½

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM