简体   繁体   English

Python在UTF-8中编码和解码问题

[英]Python problems encoding and decoding in UTF-8

So, I am using Python 3 and am reading a file and assigning it to a variable into memory as bytes. 因此,我正在使用Python 3并正在读取文件,并将其作为字节分配给内存中的变量。 I then convert the binary data to a string with: 然后,我使用以下命令将二进制数据转换为字符串:

def to_str(bytes_or_str):
  if isinstance(bytes_or_str, bytes):
    value = bytes_or_str.decode('utf-8', 'replace')
  else:
    value = bytes_or_str
  return value

The reason I do this is because I want to edit and replace some of the characters in the file with a list I made containing the first 256 chr() 我这样做的原因是因为我想用我创建的包含前256个chr()的列表来编辑和替换文件中的某些字符。

Once the loaded file variable is edited, I then rewrite the file as bytes with: 编辑加载的文件变量后,我将使用以下命令将文件重写为字节:

def to_bytes(bytes_or_str):
  if isinstance(bytes_or_str, str):
    value = bytes_or_str.encode('utf-8', 'replace')
  else:
    value = bytes_or_str
  return value

It works great, as long as I only use ASCII characters. 只要我只使用ASCII字符,它就很好用。 I can use latin-1 instead of utf-8 and it works up to 256 characters, but after 256 the encoding and decoding methods are broken. 我可以使用latin-1而不是utf-8,它最多可以处理256个字符,但是在256个字符之后,编码和解码方法就被破坏了。 Latin-1 is single byte up to 256 which I am guessing is the reason why it works up to but not beyond 256. I would like to use utf-8 because it covers a broader spectrum of characters, but it fails with my two encode/decode methods above and data gets lost if I use characters that aren't ASCII. Latin-1是最多256个单字节,我猜这是它可以工作但不超过256个字节的原因。我想使用utf-8,因为它涵盖了更广泛的字符范围,但是我的两个编码都失败了如果使用非ASCII字符,则上面的/ decode方法会丢失数据。 I was wondering if this problem is caused by the fact that utf-8 uses more than one byte above chr(128) or something else? 我想知道这个问题是否是由于utf-8在chr(128)之上使用了一个以上字节或其他原因导致的? I was wondering if I need to use something like the pack() method to isolate characters using more than one byte? 我想知道是否需要使用诸如pack()方法之类的东西来使用多个字节隔离字符? With this function I can find how many bytes a character in UTF-8 is: 通过此功能,我可以找到UTF-8中的一个字符有多少个字节:

def utf8len(x):
return len(x.encode('utf-8'))

If the loss of data error in encoding is caused by more than one byte per character, maybe I can use this somehow? 如果编码中的数据错误丢失是由每个字符超过一个字节引起的,也许我可以以某种方式使用它? Anyone have any other ideas? 还有其他想法吗? Thanks for any help. 谢谢你的帮助。

Also: Lets say I change this character 'Ω' to bytes which reads as: b'\\xe2\\x84\\xa6' in the python console. 另外:假设我将这个字符'Ω'更改为字节,在python控制台中读为:b'\\ xe2 \\ x84 \\ xa6'。 How exactly does this work if each character in bytes is a set of more characters? 如果以字节为单位的每个字符是一组更多的字符,这将如何工作? When I convert a character to bytes, Python displays it as characters and not 0's and 1's? 当我将字符转换为字节时,Python将其显示为字符而不是0和1? Aren't bytes 0's and 1's? 字节0和1不是吗? I don't know what Python is doing here. 我不知道Python在这里做什么。

I made this code to try to explain how it works but I still don't completely understand: 我编写了这段代码试图解释其工作原理,但我仍然不完全了解:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def string2bits(s=''):
    return [bin(ord(x))[2:].zfill(8) for x in s]

def bits2string(b=None):
    return ''.join([chr(int(x, 2)) for x in b])

def utf8len(x):
    return len(x.encode('utf-8'))

def latin1len(x):
    return len(x.encode('latin-1'))

char_num = 255
def_char = chr(char_num)

char = def_char
bit = string2bits(char)
char2 = bits2string(bit)

print ('\nString:')
print (char2)

print( '\nUTF-8 byte Len:')
print(utf8len(char))
# I had to add this next if statement because:
#  LATIN-1 can't encode character '\u0100' in position 0: ordinal not in range(256)
if char_num < 256:
    print( '\nLatin-1 byte Len:')
    print(latin1len(char))

print ('\nList of Bits:')
for x in bit:
    print (x)

At the beginning of the code in the # comment above, I can change the script encoding between utf-8 and latin-1 and also change the char_num variable to see what the string of bits are for that character in each encoding, but if its above 255 for latin-1 I get the error: UnicodeEncodeError: 'latin-1' codec can't encode character '\Ā' in position 0: ordinal not in range(256) 在上面#注释中的代码开头,我可以更改utf-8和latin-1之间的脚本编码,还可以更改char_num变量以查看每种编码中该字符的位字符串,但是如果大于255的latin-1时出现错误:UnicodeEncodeError:'latin-1'编解码器无法在位置0编码字符'\\ u0100':序数不在范围内(256)

If I hard code the encoding from utf-8 to latin-1 with: 如果我用以下代码将utf-8的编码硬编码为latin-1:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

Shouldn't this code display the bits of the def_char for latin-1 encoding? 此代码不应该显示def_char的位以进行latin-1编码吗? How does Python work here? Python在这里如何工作?

I think the problem is, that in a jpeg header there are stored values which can have any value of a byte (for example pixel density, length of markers and so on). 我认为问题是,在jpeg标头中存储的值可以具有任何字节值(例如,像素密度,标记长度等)。

https://en.wikipedia.org/wiki/JPEG_File_Interchange_Format https://zh.wikipedia.org/wiki/JPEG_File_Interchange_Format

In Latin-1 every character is one byte, but not every value between 0-255 is defined. 在Latin-1中,每个字符都是一个字节,但并未定义0-255之间的每个值。

https://en.wikipedia.org/wiki/ISO/IEC_8859-1 https://zh.wikipedia.org/wiki/ISO/IEC_8859-1

However, UTF-8 is an multibyte encoding. 但是,UTF-8是多字节编码。 If you exceed 127, the first byte has to start with 110 (for 2 byte chars), 1110 (for three byte chars) and 11110 (for four byte chars). 如果超过127,则第一个字节必须以110(对于2个字节字符),1110(对于三个字节字符)和11110(对于四个字节字符)开始。 The second, third and fourth byte have to start with 10... 第二,第三和第四字节必须以10开头。

https://en.wikipedia.org/wiki/UTF-8 https://zh.wikipedia.org/wiki/UTF-8

So probability of getting invalid byte(sequences) is high if you read arbitrary bytes and you probably do so by reading a jpeg header. 因此,如果您读取任意字节,则获得无效字节(序列)的可能性很高,并且您很可能通过读取jpeg标头来这样做。 Therefore it can be, that you got valid bytes for Latin-1 and not for UTF-8 incidentally. 因此,可以得到的是拉丁1的有效字节,而不是UTF-8的有效字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM