Python在UTF-8中编码和解码问题

Question

So, I am using Python 3 and am reading a file and assigning it to a variable into memory as bytes. 因此，我正在使用Python 3并正在读取文件，并将其作为字节分配给内存中的变量。 I then convert the binary data to a string with: 然后，我使用以下命令将二进制数据转换为字符串：

def to_str(bytes_or_str):
  if isinstance(bytes_or_str, bytes):
    value = bytes_or_str.decode('utf-8', 'replace')
  else:
    value = bytes_or_str
  return value

The reason I do this is because I want to edit and replace some of the characters in the file with a list I made containing the first 256 chr() 我这样做的原因是因为我想用我创建的包含前256个chr（）的列表来编辑和替换文件中的某些字符。

Once the loaded file variable is edited, I then rewrite the file as bytes with: 编辑加载的文件变量后，我将使用以下命令将文件重写为字节：

def to_bytes(bytes_or_str):
  if isinstance(bytes_or_str, str):
    value = bytes_or_str.encode('utf-8', 'replace')
  else:
    value = bytes_or_str
  return value

It works great, as long as I only use ASCII characters. 只要我只使用ASCII字符，它就很好用。 I can use latin-1 instead of utf-8 and it works up to 256 characters, but after 256 the encoding and decoding methods are broken. 我可以使用latin-1而不是utf-8，它最多可以处理256个字符，但是在256个字符之后，编码和解码方法就被破坏了。 Latin-1 is single byte up to 256 which I am guessing is the reason why it works up to but not beyond 256. I would like to use utf-8 because it covers a broader spectrum of characters, but it fails with my two encode/decode methods above and data gets lost if I use characters that aren't ASCII. Latin-1是最多256个单字节，我猜这是它可以工作但不超过256个字节的原因。我想使用utf-8，因为它涵盖了更广泛的字符范围，但是我的两个编码都失败了如果使用非ASCII字符，则上面的/ decode方法会丢失数据。 I was wondering if this problem is caused by the fact that utf-8 uses more than one byte above chr(128) or something else? 我想知道这个问题是否是由于utf-8在chr（128）之上使用了一个以上字节或其他原因导致的？ I was wondering if I need to use something like the pack() method to isolate characters using more than one byte? 我想知道是否需要使用诸如pack（）方法之类的东西来使用多个字节隔离字符？ With this function I can find how many bytes a character in UTF-8 is: 通过此功能，我可以找到UTF-8中的一个字符有多少个字节：

def utf8len(x):
return len(x.encode('utf-8'))

If the loss of data error in encoding is caused by more than one byte per character, maybe I can use this somehow? 如果编码中的数据错误丢失是由每个字符超过一个字节引起的，也许我可以以某种方式使用它？ Anyone have any other ideas? 还有其他想法吗？ Thanks for any help. 谢谢你的帮助。

Also: Lets say I change this character 'Ω' to bytes which reads as: b'\\xe2\\x84\\xa6' in the python console. 另外：假设我将这个字符'Ω'更改为字节，在python控制台中读为：b'\\ xe2 \\ x84 \\ xa6'。 How exactly does this work if each character in bytes is a set of more characters? 如果以字节为单位的每个字符是一组更多的字符，这将如何工作？ When I convert a character to bytes, Python displays it as characters and not 0's and 1's? 当我将字符转换为字节时，Python将其显示为字符而不是0和1？ Aren't bytes 0's and 1's? 字节0和1不是吗？ I don't know what Python is doing here. 我不知道Python在这里做什么。

I made this code to try to explain how it works but I still don't completely understand: 我编写了这段代码试图解释其工作原理，但我仍然不完全了解：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def string2bits(s=''):
    return [bin(ord(x))[2:].zfill(8) for x in s]

def bits2string(b=None):
    return ''.join([chr(int(x, 2)) for x in b])

def utf8len(x):
    return len(x.encode('utf-8'))

def latin1len(x):
    return len(x.encode('latin-1'))

char_num = 255
def_char = chr(char_num)

char = def_char
bit = string2bits(char)
char2 = bits2string(bit)

print ('\nString:')
print (char2)

print( '\nUTF-8 byte Len:')
print(utf8len(char))
# I had to add this next if statement because:
#  LATIN-1 can't encode character '\u0100' in position 0: ordinal not in range(256)
if char_num < 256:
    print( '\nLatin-1 byte Len:')
    print(latin1len(char))

print ('\nList of Bits:')
for x in bit:
    print (x)

At the beginning of the code in the # comment above, I can change the script encoding between utf-8 and latin-1 and also change the char_num variable to see what the string of bits are for that character in each encoding, but if its above 255 for latin-1 I get the error: UnicodeEncodeError: 'latin-1' codec can't encode character '\Ā' in position 0: ordinal not in range(256) 在上面＃注释中的代码开头，我可以更改utf-8和latin-1之间的脚本编码，还可以更改char_num变量以查看每种编码中该字符的位字符串，但是如果大于255的latin-1时出现错误：UnicodeEncodeError：'latin-1'编解码器无法在位置0编码字符'\\ u0100'：序数不在范围内（256）

If I hard code the encoding from utf-8 to latin-1 with: 如果我用以下代码将utf-8的编码硬编码为latin-1：

#!/usr/bin/env python
# -*- coding: latin-1 -*-

Shouldn't this code display the bits of the def_char for latin-1 encoding? 此代码不应该显示def_char的位以进行latin-1编码吗？ How does Python work here? Python在这里如何工作？

Answer 1

I think the problem is, that in a jpeg header there are stored values which can have any value of a byte (for example pixel density, length of markers and so on). 我认为问题是，在jpeg标头中存储的值可以具有任何字节值（例如，像素密度，标记长度等）。

https://en.wikipedia.org/wiki/JPEG_File_Interchange_Format https://zh.wikipedia.org/wiki/JPEG_File_Interchange_Format

In Latin-1 every character is one byte, but not every value between 0-255 is defined. 在Latin-1中，每个字符都是一个字节，但并未定义0-255之间的每个值。

https://en.wikipedia.org/wiki/ISO/IEC_8859-1 https://zh.wikipedia.org/wiki/ISO/IEC_8859-1

However, UTF-8 is an multibyte encoding. 但是，UTF-8是多字节编码。 If you exceed 127, the first byte has to start with 110 (for 2 byte chars), 1110 (for three byte chars) and 11110 (for four byte chars). 如果超过127，则第一个字节必须以110（对于2个字节字符），1110（对于三个字节字符）和11110（对于四个字节字符）开始。 The second, third and fourth byte have to start with 10... 第二，第三和第四字节必须以10开头。

https://en.wikipedia.org/wiki/UTF-8 https://zh.wikipedia.org/wiki/UTF-8

So probability of getting invalid byte(sequences) is high if you read arbitrary bytes and you probably do so by reading a jpeg header. 因此，如果您读取任意字节，则获得无效字节（序列）的可能性很高，并且您很可能通过读取jpeg标头来这样做。 Therefore it can be, that you got valid bytes for Latin-1 and not for UTF-8 incidentally. 因此，可以得到的是拉丁1的有效字节，而不是UTF-8的有效字节。

Python在UTF-8中编码和解码问题

问题描述

1 个解决方案

解决方案1
0 2017-09-10 17:42:39

Python在UTF-8中编码和解码问题

问题描述

1 个解决方案

解决方案1 0 2017-09-10 17:42:39

解决方案1
0 2017-09-10 17:42:39