I'm using sqlalchemy core to execute string based queries. I have set charset to utf8mb4
on the connection string like this:
"mysql+mysqldb://{user}:{password}@{host}:{port}/{db}?charset=utf8mb4"
For some simple select queries (eg, select name from users where id=XXX limit 1
), when the resultset has some unicode characters (eg, '
, ì
, etc), it errors out with the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 11: invalid start byte
But the error itself is not reproducible. When I run the same query from a python shell, it works without errors. But it errors out on a web request or background job.
I'm using Python 3.8 and sqlalchemy 1.3.24.
I have also tried explicitly specifying charset: utf8mb4
as a connect_args
property with create_engine()
.
The underlying database is mysql 5.7 and all the unicode columns have utf8mb4
explicitly set as the characters set in the schema. Update: The database is actually AWS RDS Aurora MySQL.
Appreciate any insights on the error or how to reproduce it reliably.
Can you try with use_unicode=true
parameter in the url?
The MySQL documentation Connect-Time Error Handling describes a bug in the MySQL 8.0 client library when you use the MySQL 8.0 client library to connect to a MySQL 5.7 server with the utf8mb4 charset. The MySQL 8.0 client asks for the utf8mb4_0900_ai_ci collation, but the MySQL 5.7 server does not recognize that collation, so the server silently falls back to the latin1 charset with latin1_swedish_ci collation. Subsequently the server sends latin1 result sets, but the client thinks that it is receiving utf8mb4, which eventually results in a UnicodeDecodeError
. As a workaround you have to explicitly SET NAMES utf8mb4
. I created an issue mysqlclient#504 to ask that the python client do that every time.
To confirm that the charset is incorrect after connecting, double check the server's value of character_set_client
(the charset that statements are interpreted in), character_set_connection
(the charset that statements are converted to), and character_set_results
(the charset that result sets are sent as). If they are latin1
despite you trying to connect using utf8mb4, then this bug may have been triggered.
with con.cursor() as c:
c.execute("show variables like 'character_set_%'")
for row in c:
print(row)
(b'character_set_client', b'latin1')
(b'character_set_connection', b'latin1')
(b'character_set_database', b'latin1')
(b'character_set_filesystem', b'binary')
(b'character_set_results', b'latin1')
(b'character_set_server', b'latin1')
(b'character_set_system', b'utf8')
(b'character_sets_dir', b'/usr/share/mysql/charsets/')
I believe that a workaround of the issue would be to do the following after connecting:
# explicitly set connection charset to the same as MySQLdb.connect()
con.query("SET NAMES utf8mb4")
con.store_result()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.