如何使用python将具有cp1252字符的unicode字符串转换为UTF-8？

Question

I am getting text through an API that returns characters with a windows encoded apostrophe (\\x92): 我正在通过API获取文本，该API返回带有Windows编码的撇号（\\ x92）的字符：

> python
>>> title = u'There\x92s thirty days in June'
>>> title
u'There\x92s thirty days in June'
>>> print title
Theres thirty days in June
>>> type(title)
<type 'unicode'>

I'm trying to convert this string to UTF-8 so that it instead returns: "There's thirty days in June" 我正在尝试将此字符串转换为UTF-8，以便改为返回：“六月有30天”

When I try to decode or encode this unicode string, it throws an error: 当我尝试对该unicode字符串进行解码或编码时，会引发错误：

>>> title.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 5: ordinal not in range(128)

>>> title.encode("cp1252").decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x92' in position 5: character maps to <undefined>

If I were to initialize the string as plain-text and then decode it, it works: 如果我将字符串初始化为纯文本，然后对其进行解码，那么它将起作用：

>>>title = 'There\x92s thirty days in June'
>>> type(title)
<type 'str'>
>>>print title.decode('cp1252')
There’s thirty days in June

My question is how do I convert the unicode string that I'm getting into a plain-text string so that I can decode it? 我的问题是如何将要获取的unicode字符串转换为纯文本字符串，以便对其进行解码？

Answer 1

It seems your string was decoded with latin1 (as it is of type unicode ) 看来您的字符串已使用latin1 解码（因为它是unicode类型）

To convert it back to the bytes it originally was, you need to encode using that encoding ( latin1 ) 要将其转换回原来的字节，您需要使用该编码进行编码（ latin1 ）
Then to get text back ( unicode ) you must decode using the proper codec ( cp1252 ) 然后让背课文（ unicode ），你必须解码使用正确的编解码器（ cp1252 ）
finally, if you want to get to utf-8 bytes you must encode using the UTF-8 codec. 最后，如果你想获得对utf-8字节必须进行编码使用UTF-8编码解码器。

In code: 在代码中：

>>> title = u'There\x92s thirty days in June'
>>> title.encode('latin1')
'There\x92s thirty days in June'
>>> title.encode('latin1').decode('cp1252')
u'There\u2019s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252'))
There’s thirty days in June
>>> title.encode('latin1').decode('cp1252').encode('UTF-8')
'There\xe2\x80\x99s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252').encode('UTF-8'))
There’s thirty days in June

Depending on whether the API takes text ( unicode ) or bytes , 3. may not be necessary. 根据API是否采用文本（ unicode ）或bytes ，可能不需要3.。

如何使用python将具有cp1252字符的unicode字符串转换为UTF-8？

问题描述

1 个解决方案

解决方案1
4 已采纳 2017-07-25 01:48:39

如何使用python将具有cp1252字符的unicode字符串转换为UTF-8？

问题描述

1 个解决方案

解决方案1 4 已采纳 2017-07-25 01:48:39

解决方案1
4 已采纳 2017-07-25 01:48:39