简体   繁体   中英

Python problems encoding and decoding in UTF-8

So, I am using Python 3 and am reading a file and assigning it to a variable into memory as bytes. I then convert the binary data to a string with:

def to_str(bytes_or_str):
  if isinstance(bytes_or_str, bytes):
    value = bytes_or_str.decode('utf-8', 'replace')
  else:
    value = bytes_or_str
  return value

The reason I do this is because I want to edit and replace some of the characters in the file with a list I made containing the first 256 chr()

Once the loaded file variable is edited, I then rewrite the file as bytes with:

def to_bytes(bytes_or_str):
  if isinstance(bytes_or_str, str):
    value = bytes_or_str.encode('utf-8', 'replace')
  else:
    value = bytes_or_str
  return value

It works great, as long as I only use ASCII characters. I can use latin-1 instead of utf-8 and it works up to 256 characters, but after 256 the encoding and decoding methods are broken. Latin-1 is single byte up to 256 which I am guessing is the reason why it works up to but not beyond 256. I would like to use utf-8 because it covers a broader spectrum of characters, but it fails with my two encode/decode methods above and data gets lost if I use characters that aren't ASCII. I was wondering if this problem is caused by the fact that utf-8 uses more than one byte above chr(128) or something else? I was wondering if I need to use something like the pack() method to isolate characters using more than one byte? With this function I can find how many bytes a character in UTF-8 is:

def utf8len(x):
return len(x.encode('utf-8'))

If the loss of data error in encoding is caused by more than one byte per character, maybe I can use this somehow? Anyone have any other ideas? Thanks for any help.

Also: Lets say I change this character 'Ω' to bytes which reads as: b'\\xe2\\x84\\xa6' in the python console. How exactly does this work if each character in bytes is a set of more characters? When I convert a character to bytes, Python displays it as characters and not 0's and 1's? Aren't bytes 0's and 1's? I don't know what Python is doing here.

I made this code to try to explain how it works but I still don't completely understand:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def string2bits(s=''):
    return [bin(ord(x))[2:].zfill(8) for x in s]

def bits2string(b=None):
    return ''.join([chr(int(x, 2)) for x in b])

def utf8len(x):
    return len(x.encode('utf-8'))

def latin1len(x):
    return len(x.encode('latin-1'))

char_num = 255
def_char = chr(char_num)

char = def_char
bit = string2bits(char)
char2 = bits2string(bit)

print ('\nString:')
print (char2)

print( '\nUTF-8 byte Len:')
print(utf8len(char))
# I had to add this next if statement because:
#  LATIN-1 can't encode character '\u0100' in position 0: ordinal not in range(256)
if char_num < 256:
    print( '\nLatin-1 byte Len:')
    print(latin1len(char))

print ('\nList of Bits:')
for x in bit:
    print (x)

At the beginning of the code in the # comment above, I can change the script encoding between utf-8 and latin-1 and also change the char_num variable to see what the string of bits are for that character in each encoding, but if its above 255 for latin-1 I get the error: UnicodeEncodeError: 'latin-1' codec can't encode character '\Ā' in position 0: ordinal not in range(256)

If I hard code the encoding from utf-8 to latin-1 with:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

Shouldn't this code display the bits of the def_char for latin-1 encoding? How does Python work here?

I think the problem is, that in a jpeg header there are stored values which can have any value of a byte (for example pixel density, length of markers and so on).

https://en.wikipedia.org/wiki/JPEG_File_Interchange_Format

In Latin-1 every character is one byte, but not every value between 0-255 is defined.

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

However, UTF-8 is an multibyte encoding. If you exceed 127, the first byte has to start with 110 (for 2 byte chars), 1110 (for three byte chars) and 11110 (for four byte chars). The second, third and fourth byte have to start with 10...

https://en.wikipedia.org/wiki/UTF-8

So probability of getting invalid byte(sequences) is high if you read arbitrary bytes and you probably do so by reading a jpeg header. Therefore it can be, that you got valid bytes for Latin-1 and not for UTF-8 incidentally.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM