简体   繁体   中英

UnicodeDecodeError on sqlalchemy connection.execute() for select queries

I'm using sqlalchemy core to execute string based queries. I have set charset to utf8mb4 on the connection string like this:

"mysql+mysqldb://{user}:{password}@{host}:{port}/{db}?charset=utf8mb4"

For some simple select queries (eg, select name from users where id=XXX limit 1 ), when the resultset has some unicode characters (eg, ' , ì , etc), it errors out with the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 11: invalid start byte

But the error itself is not reproducible. When I run the same query from a python shell, it works without errors. But it errors out on a web request or background job.

I'm using Python 3.8 and sqlalchemy 1.3.24.

I have also tried explicitly specifying charset: utf8mb4 as a connect_args property with create_engine() .

The underlying database is mysql 5.7 and all the unicode columns have utf8mb4 explicitly set as the characters set in the schema. Update: The database is actually AWS RDS Aurora MySQL.

Appreciate any insights on the error or how to reproduce it reliably.

Can you try with use_unicode=true parameter in the url?

The MySQL documentation Connect-Time Error Handling describes a bug in the MySQL 8.0 client library when you use the MySQL 8.0 client library to connect to a MySQL 5.7 server with the utf8mb4 charset. The MySQL 8.0 client asks for the utf8mb4_0900_ai_ci collation, but the MySQL 5.7 server does not recognize that collation, so the server silently falls back to the latin1 charset with latin1_swedish_ci collation. Subsequently the server sends latin1 result sets, but the client thinks that it is receiving utf8mb4, which eventually results in a UnicodeDecodeError . As a workaround you have to explicitly SET NAMES utf8mb4 . I created an issue mysqlclient#504 to ask that the python client do that every time.

To confirm that the charset is incorrect after connecting, double check the server's value of character_set_client (the charset that statements are interpreted in), character_set_connection (the charset that statements are converted to), and character_set_results (the charset that result sets are sent as). If they are latin1 despite you trying to connect using utf8mb4, then this bug may have been triggered.

with con.cursor() as c:
  c.execute("show variables like 'character_set_%'")
  for row in c:
    print(row)
(b'character_set_client', b'latin1')
(b'character_set_connection', b'latin1')
(b'character_set_database', b'latin1')
(b'character_set_filesystem', b'binary')
(b'character_set_results', b'latin1')
(b'character_set_server', b'latin1')
(b'character_set_system', b'utf8')
(b'character_sets_dir', b'/usr/share/mysql/charsets/')

I believe that a workaround of the issue would be to do the following after connecting:

# explicitly set connection charset to the same as MySQLdb.connect()
con.query("SET NAMES utf8mb4")
con.store_result()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM