简体   繁体   English

Python:从 ISO-8859-1/latin1 转换为 UTF-8

[英]Python: Converting from ISO-8859-1/latin1 to UTF-8

I have this string that has been decoded from Quoted-printable to ISO-8859-1 with the email module.我使用 email 模块将此字符串从 Quoted-printable 解码为 ISO-8859-1。 This gives me strings like "\xC4pple" which would correspond to "Äpple" (Apple in Swedish).这给了我像“\xC4pple”这样的字符串,它对应于“Äpple”(瑞典语中的 Apple)。 However, I can't convert those strings to UTF-8.但是,我无法将这些字符串转换为 UTF-8。

>>> apple = "\xC4pple"
>>> apple
'\xc4pple'
>>> apple.encode("UTF-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in     range(128)

What should I do?我应该怎么办?

This is a common problem, so here's a relatively thorough illustration.这是一个常见的问题,所以这里有一个比较彻底的说明。

For non-unicode strings (ie those without u prefix like u'\xc4pple' ), one must decode from the native encoding ( iso8859-1 / latin1 , unless modified with the enigmatic sys.setdefaultencoding function) to unicode , then encode to a character set that can display the characters you wish, in this case I'd recommend UTF-8 .对于非 unicode 字符串(即没有u前缀的字符串,例如u'\xc4pple' ),必须从本机编码( iso8859-1 / latin1 ,除非使用神秘的sys.setdefaultencoding函数修改)解码为unicode ,然后编码为可以显示您希望的字符的字符集,在这种情况下,我推荐UTF-8

First, here is a handy utility function that'll help illuminate the patterns of Python 2.7 string and unicode:首先,这是一个方便的实用程序 function,它将帮助阐明 Python 2.7 字符串和 unicode 的模式:

>>> def tell_me_about(s): return (type(s), s)

A plain string一个普通的字符串

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string

>>> tell_me_about(v)
(<type 'str'>, '\xc4pple')

>>> v
'\xc4pple'        # representation in memory

>>> print v
?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars
                  # note that '\xc4' has no representation in iso-8859-1, 
                  # so is printed as "?".

Decoding a iso8859-1 string - convert plain string to unicode解码 iso8859-1 字符串 - 将纯字符串转换为 unicode

>>> uv = v.decode("iso-8859-1")
>>> uv
u'\xc4pple'       # decoding iso-8859-1 becomes unicode, in memory

>>> tell_me_about(uv)
(<type 'unicode'>, u'\xc4pple')

>>> print v.decode("iso-8859-1")
Äpple             # convert unicode to the default character set
                  # (utf-8, based on sys.stdout.encoding)

>>> v.decode('iso-8859-1') == u'\xc4pple'
True              # one could have just used a unicode representation 
                  # from the start

A little more illustration — with “Ä”多一点插图——用“Ä”

>>> u"Ä" == u"\xc4"
True              # the native unicode char and escaped versions are the same

>>> "Ä" == u"\xc4"  
False             # the native unicode char is '\xc3\x84' in latin1

>>> "Ä".decode('utf8') == u"\xc4"
True              # one can decode the string to get unicode

>>> "Ä" == "\xc4"
False             # the native character and the escaped string are
                  # of course not equal ('\xc3\x84' != '\xc4').

Encoding to UTF编码为 UTF

>>> u8 = v.decode("iso-8859-1").encode("utf-8")
>>> u8
'\xc3\x84pple'    # convert iso-8859-1 to unicode to utf-8

>>> tell_me_about(u8)
(<type 'str'>, '\xc3\x84pple')

>>> u16 = v.decode('iso-8859-1').encode('utf-16')
>>> tell_me_about(u16)
(<type 'str'>, '\xff\xfe\xc4\x00p\x00p\x00l\x00e\x00')

>>> tell_me_about(u8.decode('utf8'))
(<type 'unicode'>, u'\xc4pple')

>>> tell_me_about(u16.decode('utf16'))
(<type 'unicode'>, u'\xc4pple')

Relationship between unicode and UTF and latin1 unicode与UTF和latin1的关系

>>> print u8
Äpple             # printing utf-8 - because of the encoding we now know
                  # how to print the characters

>>> print u8.decode('utf-8') # printing unicode
Äpple

>>> print u16     # printing 'bytes' of u16
���pple

>>> print u16.decode('utf16')
Äpple             # printing unicode

>>> v == u8
False             # v is a iso8859-1 string; u8 is a utf-8 string

>>> v.decode('iso8859-1') == u8
False             # v.decode(...) returns unicode

>>> u8.decode('utf-8') == v.decode('latin1') == u16.decode('utf-16')
True              # all decode to the same unicode memory representation
                  # (latin1 is iso-8859-1)

Unicode Exceptions Unicode 异常

 >>> u8.encode('iso8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
  ordinal not in range(128)

>>> u16.encode('iso8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
  ordinal not in range(128)

>>> v.encode('iso8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
  ordinal not in range(128)

One would get around these by converting from the specific encoding (latin-1, utf8, utf16) to unicode eg u8.decode('utf8').encode('latin1') .可以通过从特定编码 (latin-1, utf8, utf16) 转换为 unicode 来解决这些问题,例如u8.decode('utf8').encode('latin1')

So perhaps one could draw the following principles and generalizations:因此,也许可以得出以下原则和概括:

  • a type str is a set of bytes, which may have one of a number of encodings such as Latin-1, UTF-8, and UTF-16 str类型是一组字节,它可能具有多种编码之一,例如 Latin-1、UTF-8 和 UTF-16
  • a type unicode is a set of bytes that can be converted to any number of encodings, most commonly UTF-8 and latin-1 (iso8859-1)类型unicode是一组字节,可以转换为任意数量的编码,最常见的是 UTF-8 和 latin-1 (iso8859-1)
  • the print command has its own logic for encoding , set to sys.stdout.encoding and defaulting to UTF-8 print命令有自己的编码逻辑,设置为sys.stdout.encoding并默认为 UTF-8
  • One must decode a str to unicode before converting to another encoding.在转换为另一种编码之前,必须将str解码为 unicode。

Of course, all of this changes in Python 3.x.当然,Python 3.x 中的所有这些变化。

Hope that is illuminating.希望这是有启发性的。

Further reading进一步阅读

And the very illustrative rants by Armin Ronacher:还有阿明·罗纳赫(Armin Ronacher)非常说明性的咆哮:

Try decoding it first, then encoding:尝试先解码,然后编码:

apple.decode('iso-8859-1').encode('utf8')

For Python 3:对于 Python 3:

bytes(apple,'iso-8859-1').decode('utf-8')

I used this for a text incorrectly encoded as iso-8859-1 (showing words like VeÅ\x99ejné ) instead of utf-8.我将它用于错误编码为 iso-8859-1 的文本(显示诸如VeÅ\x99ejné之类的词)而不是 utf-8。 This code produces correct version Veřejné .此代码生成正确的版本Veřejné

Decode to Unicode, encode the results to UTF8.解码为 Unicode,将结果编码为 UTF8。

apple.decode('latin1').encode('utf8')
concept = concept.encode('ascii', 'ignore') 
concept = MySQLdb.escape_string(concept.decode('latin1').encode('utf8').rstrip())

I do this, I am not sure if that is a good approach but it works everytime !!我这样做,我不确定这是否是一个好方法,但它每次都有效!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM