简体   繁体   中英

How to fix python hardcoded dictionary encoding issue

Error:

pymysql.err.InternalError: (1366, "Incorrect string value: '\\xEF\\xBF\\xBD 20...' for column 'history' at row 1")

I've received a few variations of this as I've tried to tweak my dictionary, always in the history column, the only variations is the characters it tells me are issues.

I can't post the dictionary because it's got sensitive information, but here is the jist:

  • I started with 200 addresses (including state, zip, etc) that needed to be validated, normalized and standardized for DB insertion.
  • I spent a lot of time on google maps validating and standardizing.
  • I decided to get fancy, and put all the crazy accented letters in the addresses of these world addresses (often copies from google because I don't know how to type and A with an o over it, lol), Singapore to Brazil, everywhere.
  • I ended up with 120 unique addresses in my dictionary after processing.
  • Everything works 100% perfectly when INSERTING the data in SQLite and OUTPUTING to a CSV. The issue is exclusively with MySQL and some sneaky un-viewable characters.

Note: I used this to remove the accents after 7 hours of copy/pasting to notepad, encoding it with notepad++ and just trying to processes the data in a way that made it all the correct encoding. I think I did lose the version with the accents and only have this tools output now.

I do not see "\\xEF\\xBF\\xBD 20..." in my dictionary I only see text. Currently I don't even see "20"... those two chars helped me find the previous issues.

Code I can show:

def insert_tables(cursor, assets_final, ips_final):
    #Insert Asset data into asset table
    field_names_dict = get_asset_field_names(assets_final)
    sql_field_names = ",".join(field_names_dict.keys())
    for key, row in assets_final.items():
        insert_sql = 'INSERT INTO asset(' + sql_field_names + ') VALUES ("' + '","'.join(field_value.replace('"', "'") for field_value in list(row.values())) + '")'
        print(insert_sql)
        cursor.execute(insert_sql)

    #Insert IP data into IP table
    field_names_dict = get_ip_field_names(ips_final)
    sql_field_names = ",".join(field_names_dict.keys())
    for hostname_key, ip_dict in ips_final.items():
        for ip_key, ip_row in ip_dict.items():
            insert_sql = 'INSERT INTO ip(' + sql_field_names + ') VALUES ("' + '","'.join(field_value.replace('"', "'") for field_value in list(ip_row.values())) + '")'
            print(insert_sql)
            cursor.execute(insert_sql)

def output_sqlite_db(sqlite_file, assets_final, ips_final):
    conn = sqlite3.connect(sqlite_file)
    cursor = conn.cursor()
    insert_tables(cursor, assets_final, ips_final)
    conn.commit()
    conn.close()

def output_mysql_db(assets_final, ips_final):
    conn = mysql.connect(host=config.mysql_ip, port=config.mysql_port, user=config.mysql_user, password=config.mysql_password, charset="utf8mb4", use_unicode=True)
    cursor = conn.cursor()
    cursor.execute('USE ' + config.mysql_DB)
    insert_tables(cursor, assets_final, ips_final)
    conn.commit()
    conn.close()

EDIT: Could this have something to do with the fact I'm using Cygwin as my terminal? HA! I added this line and got a different message (now using the accented version again):

cursor.execute('SET NAMES utf8')

Error:

pymysql.err.InternalError: (1366, "Incorrect string value: '\\xC5\\x81A II...' for column 'history' at row 1")

I can shine a bit of light on the messages that you have supplied:

Case 1:

>>> import unicodedata as ucd
>>> s1 = b"\xEF\xBF\xBD"
>>> s1
b'\xef\xbf\xbd'
>>> u1 = s1.decode('utf8')
>>> u1
'\ufffd'
>>> ucd.name(u1)
'REPLACEMENT CHARACTER'
>>>

Looks like you have obtained some bytes encoded in an encoding other than utf8 (eg cp1252) then tried bytes.decode(encoding='utf8', errors='strict') . This detected some errors. You then decoded again with errors="replace". This raised no exceptions. However your data has had the error bytes replaced by the replacement character (U+FFFD). Then you encoded your data using str.encode so that you could write to a file or database. Each replacement characters turns up as 3 hex bytes EF BF BD .

... more to come

Case 2:

>>> s2 = b"\xC5\x81A II"
>>> s2
b'\xc5\x81A II'
>>> u2 = s2.decode('utf8')
>>> u2
'\u0141A II'
>>> ucd.name(u2[0])
'LATIN CAPITAL LETTER L WITH STROKE'
>>>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM