简体   繁体   中英

Python: what does “…”.encode(“utf8”) fix?

I wanted to url encode a python string and got exceptions with hebrew strings. I couldn't fix it and started doing some guess oriented programming. Finally, doing mystr = mystr.encode("utf8") before sending it to the url encoder saved the day.

Can somebody explain what happened? What does .encode("utf8") do? My original string was a unicode string anyways (ie prefixed by au).

My original string was a unicode string anyways (ie prefixed by au)

...which is the problem. It wasn't a "string", as such, but a "Unicode object". It contains a sequence of Unicode code points. These code points must, of course, have some internal representation that Python knows about, but whatever that is is abstracted away and they're shown as those \\uXXXX entities when you print repr(my_u_str) .

To get a sequence of bytes that another program can understand, you need to take that sequence of Unicode code points and encode it. You need to decide on the encoding, because there are plenty to choose from. UTF8 and UTF16 are common choices. ASCII could be too, if it fits. u"abc".encode('ascii') works just fine.

Do my_u_str = u"\ℙython" and then type(my_u_str) and type(my_u_str.encode('utf8')) to see the difference in types: The first is <type 'unicode'> and the second is <type 'str'> . (Under Python 2.5 and 2.6, anyway).

Things are different in Python 3, but since I rarely use it I'd be talking out of my hat if I tried to say anything authoritative about it.

You original string was a unicode object containing raw Unicode code points, after encoding it as UTF-8 it is a normal byte string that contains UTF-8 encoded data.

The URL encoder seems to expect a byte string, so that it can URL-encode one byte after another and doesn't have to deal with Unicode code points. When you give it a unicode object, it tries to convert it to a byte string using some default encoding, probably ASCII. For Hebrew characters that cannot be represented as ASCII, this will lead to errors.

What does .encode("utf8") do?

It depends on which version of Python you're using:

  • In Python 3.x, it converts a str object (encoded in UTF-16 or UTF-32) into a bytes object containing the UTF-8 representation of the string.
  • In Python 2.x, it converts a unicode object into a str object encoded in UTF-8. But str has an encode method too, and writing '...'.encode('UTF-8') is equivalent to writing '...'.decode('ascii').encode('UTF-8') .

Since you mentioned the "u" prefix, you must be using 2.x. If you don't require any 2.x-only libraries, I'd recommend switching to 3.x, which has a nice clear distinction between text and binary data.

Dive into Python 3 has a good explanation of the issue.

Can somebody explain what happened?

It would help if you told us what the error message was.

The urllib.quote function expects a str object. It also happens to work with unicode objects that contain only ASCII characters, but not when they contain Hebrew letters.

In Python 3.x, urllib.parse.quote accepts both str (=Python 2.x unicode ) and bytes objects. Strings are automatically encoded in UTF-8.

"...".encode("utf-8") transforms the in-memory representation of the string into an UTF-8 -encoded string.

url encoder likely expected a bytestring, that is, string representation where each character is represented with a single byte.

It returns a UTF-8 encoded version of the Unicode string, mystr. It is important to realize that UTF-8 is simply 1 way of encoding Unicode. Python can work with many other encodings (eg. mystr.encode("utf32") or even mystr.encode("ascii")).

The link that balpha posted explains it all. In short:

The fact that your string was prefixed with "u" just means it's composed of Unicode characters (or code points). UTF-8 is an encoding of this string into a sequence of bytes .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM