简体   繁体   中英

Python problems encoding and decoding in UTF-8

So, I am using Python 3 and am reading a file and assigning it to a variable into memory as bytes. I then convert the binary data to a string with:

def to_str(bytes_or_str):
  if isinstance(bytes_or_str, bytes):
    value = bytes_or_str.decode('utf-8', 'replace')
    value = bytes_or_str
  return value

The reason I do this is because I want to edit and replace some of the characters in the file with a list I made containing the first 256 chr()

Once the loaded file variable is edited, I then rewrite the file as bytes with:

def to_bytes(bytes_or_str):
  if isinstance(bytes_or_str, str):
    value = bytes_or_str.encode('utf-8', 'replace')
    value = bytes_or_str
  return value

It works great, as long as I only use ASCII characters. I can use latin-1 instead of utf-8 and it works up to 256 characters, but after 256 the encoding and decoding methods are broken. Latin-1 is single byte up to 256 which I am guessing is the reason why it works up to but not beyond 256. I would like to use utf-8 because it covers a broader spectrum of characters, but it fails with my two encode/decode methods above and data gets lost if I use characters that aren't ASCII. I was wondering if this problem is caused by the fact that utf-8 uses more than one byte above chr(128) or something else? I was wondering if I need to use something like the pack() method to isolate characters using more than one byte? With this function I can find how many bytes a character in UTF-8 is:

def utf8len(x):
return len(x.encode('utf-8'))

If the loss of data error in encoding is caused by more than one byte per character, maybe I can use this somehow? Anyone have any other ideas? Thanks for any help.

Also: Lets say I change this character 'Ω' to bytes which reads as: b'\\xe2\\x84\\xa6' in the python console. How exactly does this work if each character in bytes is a set of more characters? When I convert a character to bytes, Python displays it as characters and not 0's and 1's? Aren't bytes 0's and 1's? I don't know what Python is doing here.

I made this code to try to explain how it works but I still don't completely understand:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def string2bits(s=''):
    return [bin(ord(x))[2:].zfill(8) for x in s]

def bits2string(b=None):
    return ''.join([chr(int(x, 2)) for x in b])

def utf8len(x):
    return len(x.encode('utf-8'))

def latin1len(x):
    return len(x.encode('latin-1'))

char_num = 255
def_char = chr(char_num)

char = def_char
bit = string2bits(char)
char2 = bits2string(bit)

print ('\nString:')
print (char2)

print( '\nUTF-8 byte Len:')
# I had to add this next if statement because:
#  LATIN-1 can't encode character '\u0100' in position 0: ordinal not in range(256)
if char_num < 256:
    print( '\nLatin-1 byte Len:')

print ('\nList of Bits:')
for x in bit:
    print (x)

At the beginning of the code in the # comment above, I can change the script encoding between utf-8 and latin-1 and also change the char_num variable to see what the string of bits are for that character in each encoding, but if its above 255 for latin-1 I get the error: UnicodeEncodeError: 'latin-1' codec can't encode character '\Ā' in position 0: ordinal not in range(256)

If I hard code the encoding from utf-8 to latin-1 with:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

Shouldn't this code display the bits of the def_char for latin-1 encoding? How does Python work here?

I think the problem is, that in a jpeg header there are stored values which can have any value of a byte (for example pixel density, length of markers and so on).


In Latin-1 every character is one byte, but not every value between 0-255 is defined.


However, UTF-8 is an multibyte encoding. If you exceed 127, the first byte has to start with 110 (for 2 byte chars), 1110 (for three byte chars) and 11110 (for four byte chars). The second, third and fourth byte have to start with 10...


So probability of getting invalid byte(sequences) is high if you read arbitrary bytes and you probably do so by reading a jpeg header. Therefore it can be, that you got valid bytes for Latin-1 and not for UTF-8 incidentally.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM