简体繁体 English

Python：“...”.encode（“utf8”）修复了什么？

[英]Python: what does “…”.encode(“utf8”) fix?

原文 2010-07-20 14:41:40 1 6 python/ unicode/ internationalization/ urlencode/ utf-8

I wanted to url encode a python string and got exceptions with hebrew strings. 我想url编码python字符串并获得希伯来字符串的异常。 I couldn't fix it and started doing some guess oriented programming. 我无法解决它并开始做一些猜测导向的编程。 Finally, doing mystr = mystr.encode("utf8") before sending it to the url encoder saved the day. 最后，在将它发送到url编码器之前做mystr = mystr.encode("utf8")保存了一天。

Can somebody explain what happened? 有人可以解释发生了什么吗？ What does .encode("utf8") do? .encode（“utf8”）有什么作用？ My original string was a unicode string anyways (ie prefixed by au). 我的原始字符串无论如何都是unicode字符串（即以au为前缀）。

6 个解决方案

My original string was a unicode string anyways (ie prefixed by au) 我的原始字符串无论如何都是unicode字符串（即以au为前缀）

...which is the problem. ......这是问题所在。 It wasn't a "string", as such, but a "Unicode object". 它不是“字符串”，而是“Unicode对象”。 It contains a sequence of Unicode code points. 它包含一系列Unicode代码点。 These code points must, of course, have some internal representation that Python knows about, but whatever that is is abstracted away and they're shown as those \\uXXXX entities when you print repr(my_u_str) . 当然，这些代码点必须具有Python所知道的一些内部表示，但无论它是什么都被抽象出来，当你print repr(my_u_str)时它们被显示为\\uXXXX实体。

To get a sequence of bytes that another program can understand, you need to take that sequence of Unicode code points and encode it. 要获得另一个程序可以理解的字节序列，您需要获取Unicode代码点序列并对其进行编码。 You need to decide on the encoding, because there are plenty to choose from. 您需要决定编码，因为有很多可供选择。 UTF8 and UTF16 are common choices. UTF8和UTF16是常见的选择。 ASCII could be too, if it fits. ASCII也可以，如果它适合。 u"abc".encode('ascii') works just fine. u"abc".encode('ascii')工作得很好。

Do my_u_str = u"\ℙython" and then type(my_u_str) and type(my_u_str.encode('utf8')) to see the difference in types: The first is <type 'unicode'> and the second is <type 'str'> . 做my_u_str = u"\ℙython" ，然后type(my_u_str)并type(my_u_str.encode('utf8'))以查看类型的差异：第一个是<type 'unicode'> type(my_u_str.encode('utf8')) <type 'unicode'> ，第二个是<type 'str'> 。 (Under Python 2.5 and 2.6, anyway). （无论如何，在Python 2.5和2.6下）。

Things are different in Python 3, but since I rarely use it I'd be talking out of my hat if I tried to say anything authoritative about it. Python 3中的情况有所不同，但由于我很少使用它，如果我试图说出任何权威的话，我就会说出来。

You original string was a unicode object containing raw Unicode code points, after encoding it as UTF-8 it is a normal byte string that contains UTF-8 encoded data. 原始字符串是包含原始Unicode代码点的unicode对象，在将其编码为UTF-8之后，它是包含UTF-8编码数据的普通字节字符串。

The URL encoder seems to expect a byte string, so that it can URL-encode one byte after another and doesn't have to deal with Unicode code points. URL编码器似乎需要一个字节字符串，因此它可以对一个接一个的字节进行URL编码，而不必处理Unicode代码点。 When you give it a unicode object, it tries to convert it to a byte string using some default encoding, probably ASCII. 当你给它一个unicode对象时，它会尝试使用一些默认编码将其转换为字节字符串，可能是ASCII。 For Hebrew characters that cannot be represented as ASCII, this will lead to errors. 对于无法表示为ASCII的希伯来字符，这将导致错误。

What does .encode("utf8") do? .encode（“utf8”）有什么作用？

It depends on which version of Python you're using: 这取决于您使用的Python版本：

In Python 3.x, it converts a str object (encoded in UTF-16 or UTF-32) into a bytes object containing the UTF-8 representation of the string. 在Python 3.x中，它将str对象（以UTF-16或UTF-32编码）转换为包含字符串的UTF-8表示的bytes对象。
In Python 2.x, it converts a unicode object into a str object encoded in UTF-8. 在Python 2.x中，它将unicode对象转换为以UTF-8编码的str对象。 But str has an encode method too, and writing '...'.encode('UTF-8') is equivalent to writing '...'.decode('ascii').encode('UTF-8') . 但str也有encode方法，写'...'.encode('UTF-8')相当于写'...'.decode('ascii').encode('UTF-8') 。

Since you mentioned the "u" prefix, you must be using 2.x. 由于您提到了“u”前缀，因此必须使用2.x. If you don't require any 2.x-only libraries, I'd recommend switching to 3.x, which has a nice clear distinction between text and binary data. 如果您不需要任何2.x-only库，我建议切换到3.x，它在文本和二进制数据之间有明显的区别。

Dive into Python 3 has a good explanation of the issue. 深入研究Python 3可以很好地解释这个问题。

Can somebody explain what happened? 有人可以解释发生了什么吗？

It would help if you told us what the error message was. 如果您告诉我们错误消息是什么会有所帮助。

The urllib.quote function expects a str object. urllib.quote函数需要一个str对象。 It also happens to work with unicode objects that contain only ASCII characters, but not when they contain Hebrew letters. 它也适用于仅包含ASCII字符的unicode对象，但在包含希伯来字母时则不行。

In Python 3.x, urllib.parse.quote accepts both str (=Python 2.x unicode ) and bytes objects. 在Python 3.x中， urllib.parse.quote接受str （= Python 2.x unicode ）和bytes对象。 Strings are automatically encoded in UTF-8. 字符串以UTF-8自动编码。

"...".encode("utf-8") transforms the in-memory representation of the string into an UTF-8 -encoded string. “...”。encode（“utf-8”）将字符串的内存中表示转换为UTF-8编码的字符串。

url encoder likely expected a bytestring, that is, string representation where each character is represented with a single byte. url编码器可能期望一个字节串，即字符串表示，其中每个字符用单个字节表示。

It returns a UTF-8 encoded version of the Unicode string, mystr. 它返回Unicode字符串的UTF-8编码版本mystr。 It is important to realize that UTF-8 is simply 1 way of encoding Unicode. 重要的是要意识到UTF-8只是编码Unicode的一种方式。 Python can work with many other encodings (eg. mystr.encode("utf32") or even mystr.encode("ascii")). Python可以与许多其他编码一起使用（例如，mystr.encode（“utf32”）或甚至mystr.encode（“ascii”））。

The link that balpha posted explains it all. balpha发布的链接解释了这一切。 In short: 简而言之：

The fact that your string was prefixed with "u" just means it's composed of Unicode characters (or code points). 你的字符串以“u”为前缀这一事实意味着它由Unicode 字符（或代码点）组成。 UTF-8 is an encoding of this string into a sequence of bytes . UTF-8是将此字符串编码为字节序列。