简体   繁体   中英

Encoding and decoding for chars are not treated the same for polish letters

From other source i get two names with two polish letter ( ń and ó ), like below:

  • piaseczyÅ„ski
  • zielonogórski

Of course these names is more then two.

The 1st should be looks like piaseczyński and the 2nd looks good. But when I use some operation to fix it using: str(entity_name).encode('1252').decode('utf-8') then 1st is fixed, but 2nd return error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte

Why polish letter are not treated the same? How to fix it?

As you probably realise already, those strings have different encodings. The best approach is to fix it at the source, so that it always returns UTF-8 (or at least some consistent, known encoding).

If you really can't do that, you should try to decode as UTF-8 first, because it's more strict: not every string of bytes is valid UTF-8. If you get UnicodeDecodeError , try to decode it as some other encoding:

def decode_crappy_bytes(b):
    try:
        return b.decode('utf-8')
    except UnicodeDecodeError:
        return b.decode('1252')

Note that this can still fail, in two ways:

  1. If you get a string in some non-UTF-8 encoding that happens to be decodable as UTF-8 as well.
  2. If you get a string in a non-UTF-8 encoding that's not Windows codepage 1252. Another common one in Europe is ISO-8859-1 (Latin-1). Every bytestring that's valid in one is also valid in the other.

If you do need to deal with multiple different non-UTF-8 encodings and you know that it should be Polish, you could count the number of non-ASCII Polish letters in each possible decoding, and return the one with the highest score. Still not infallible, so really, it's best to fix it at the source.

@Thomas I added another except then now works perfectly:

try:
    entity_name = entity_name.encode('1252').decode('utf-8')
except UnicodeDecodeError:
    pass
except UnicodeEncodeError:
    pass

Passed for żarski .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM