简体   繁体   中英

UnicodeDecodeError Loading with sqlalchemy

I am querying a MySQL database with sqlalchemy and getting the following error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 498-499: unexpected end of data

A column in the table was defined as Unicode(500) so this error suggests to me that there is an entry that was truncated because it was longer than 500 characters. Is there a way to handle this error and still load the entry? Is there a way to find the errant entry and delete it other than trying to load every entry one by one (or in batches) until I get the error?

In short, you should change:

Unicode(500)

to:

Unicode(500, unicode_errors='ignore', convert_unicode='force')

(Python 2 code follows, but the principles hold in python 3; only some of the output will differ.)

What's going on is that when you decode a bytestring, it complains if the bytestring can't be decoded, with the error you saw.

>>> u = u'ABCDEFGH\N{TRADE MARK SIGN}'
>>> u
u'ABCDEFGH\u2122'
>>> print(u)
ABCDEFGH™
>>> s = u.encode('utf-8')
>>> s
'ABCDEFGH\xe2\x84\xa2'
>>> truncated = s[:-1]
>>> truncated
'ABCDEFGH\xe2\x84'        
>>> truncated.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/cliffdyer/.virtualenvs/edx-platform/lib/python2.7/encodings/utf_8.py", 
line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-9: unexpected 
end of data

Python provides different optional modes of handling decode errors, though. Raising an exception is the default, but you can also truncate the text or convert the malformed part of the string to the official unicode replacement character.

>>> trunc.decode('utf-8', errors='replace')
u'ABCDEFGH\ufffd'
>>> trunc.decode('utf-8', errors='ignore')
u'ABCDEFGH'

This is exactly what's happening within the column handling.

Looking at the Unicode and String classes in sqlalchemy/sql/sqltypes.py , it looks like there is a unicode_errors argument that you can pass to the constructor which passes its value through to the encoder's errors argument. There is also a note that you will need to set convert_unicode='force' to make it work.

Thus Unicode(500, unicode_errors='ignore', convert_unicode='force') should solve your problem, if you're okay with truncating the ends of your data.

If you have some control over the database, you should be able to prevent this issue in the future by defining your database to use the utf8mb4 character set. (Don't just use utf8 , or it will fail on four byte utf8 characters, including most emojis). Then you will be guaranteed to have valid utf-8 stored in and returned from your database.

Make the column you are storing into a BLOB . After loading the data, do various things such as

 SELECT MAX(LENGTH(col)) FROM ... -- to see what the longest is in _bytes_.

Copy the data into another BLOB column and do

 ALTER TABLE t MODIFY col2 TEXT CHARACTER SET utf8 ... -- to see if it converts correctly

If that succeeds, then do

 SELECT MAX(CHAR_LENGTH(col2)) ... -- to see if the longest is more than 500 _characters_.

After you have tried a few things like that, we can see what direction to take next.

In short, your MySQL setup is incorrect in that it truncates UTF-8 characters in mid-sequence. I would check twice that MySQL actually expects the character encoding of UTF-8 within the sessions and in the tables themselves.


I would suggest switching to PostgreSQL (seriously) to avoid this kind of problem: not only does PostgreSQL understand UTF-8 properly in default configurations, but also it would not ever truncate a string to fit into the value, choosing to raise an error instead:

psql (9.5.3, server 9.5.3)
Type "help" for help.

testdb=> create table foo(bar varchar(4));
CREATE TABLE
testdb=> insert into foo values ('aaaaa');
ERROR:  value too long for type character varying(4)

This is also not unlike the Zen of Python:

Explicit is better than implicit.

and

Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM