简体   繁体   中英

Python encoding problem (unicode)

Before throwing tomatoes, let me explain my problem (I have read the python unicode doc first).

I use json module to parse a json-format result into dictionaries. This, gives me unicode encoded strings (ex: u"My string t\xf4t"). Then I use Mysqldb to store this string in my Mysql database. I precise that these database is configured for utf8.

Then I retrieve my Mysql record, still using MysqlDB. Now my result printed looks like "My string t\xf4t" (without the u). As I need to compare the inserted and the retrieved strings, I have to tell python my retrieve string is unicoded.

No matter what I try, I have a UnicodeDecodeError. I tried to play with the encoding: unicode(storedInDB, "utf_8") and with the errors param ("replace"). But I still have exceptions.

Do you have hints?

Thanks for your help !

u"My string t\xf4t" is a Unicode string (its type is unicode ), but "My string t\xf4t" is a bytestring (its type is str ).

unicode(storedInDB, "utf_8") tries to decode the bytestring as UTF-8, but "My string t\xf4t" isn't valid UTF-8.

It appears that although you configured MySql for UTF-8, you didn't actually write UTF-8 data into it. You would have had to encode from Unicode to UTF-8 before sending the string.

Most likely, what you want to do is add charset='utf8' to your MySQLdb.connect() call.

For MySQL itself, character sets are set separately in many different contexts - most notably, for both table storage and for connections (and MySQL unfortunately still seems to default to latin-1 in many cases). So, you can - for example - go to the trouble of setting your entire database to use UTF-8:

CREATE DATABASE somedatabase DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;

And yet, when you connect a client, MySQL may still think you're communicating with it in some other encoding:

mysql> show variables like 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

A basic solution to this is to execute SET NAMES UTF8 immediately upon connecting, before you do anything else:

mysql> SET NAMES UTF8;
mysql> show variables like 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

However, in your case, this still isn't sufficient, because the python MySQLdb module itself also wants to be helpful and automatically encode/decode python's native unicode strings for you. So, you have to set the character set in MySQLdb. This is best done, as mentioned earlier, by passing charset='utf8' when creating your MySQLdb connection. (This will also cause MySQLdb to inform the mysql server that your connection is using UTF8, so you do not need to run SET NAMES UTF8 directly)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM