[英]How to read a single UTF-8 character from a file in Python?
f.read(1)
will return 1 byte, not one character. f.read(1)
将返回 1 个字节,而不是一个字符。 The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string.该文件是二进制文件,但文件中的特定范围是 UTF-8 编码字符串,长度位于字符串之前。 There is no newline character at the end of the string.
字符串末尾没有换行符。 How do I read such strings?
我如何阅读这样的字符串?
I have seen this question but none of the answers address the UTF-8 case.我见过这个问题,但没有一个答案涉及 UTF-8 案例。
Example code:示例代码:
file = 'temp.txt'
with open(file, 'wb') as f:
f.write(b'\x41')
f.write(b'\xD0')
f.write(b'\xB1')
f.write(b'\xC0')
with open(file, 'rb') as f:
print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
print(f.buffer.read(1), '+', f.read(1))
This outputs:这输出:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 2: invalid start byte
UnicodeDecodeError:“utf-8”编解码器无法解码 position 2 中的字节 0xc0:无效的起始字节
When f.write(b'\xC0')
is removed, it works as expected.当
f.write(b'\xC0')
被删除时,它按预期工作。 It seems to read more than it is told: the code doesn't say to read the 0xC0 byte.它似乎比它被告知的要多:代码没有说要读取 0xC0 字节。
Here's a character that takes up more than one byte.这是一个占用超过一个字节的字符。 Whether you open the file giving the utf-8 encoding or not, reading one byte seems to do the job and you get the whole character.
无论您是否打开提供 utf-8 编码的文件,读取一个字节似乎都可以完成工作,并且您会得到整个字符。
file = 'temp.txt'
with open(file, 'wb') as f:
f.write('⾀'.encode('utf-8'))
f.write(b'\x01')
with open(file, 'rb') as f:
print(f.read(1))
with open(file, 'r') as f:
print(f.read(1))
Output: Output:
b'\xe2'
⾀
Even though some of the file is non utf-8, you can still open the file in reading mode (non-binary), skip to the byte you want to read and then read a whole character by running read(1)
.即使某些文件不是 utf-8,您仍然可以以读取模式(非二进制)打开文件,跳到要读取的字节,然后通过运行
read(1)
读取整个字符。
This works even if your character isn't in the beginning of the file:即使您的角色不在文件的开头,这也有效:
file = 'temp.txt'
with open(file, 'wb') as f:
f.write(b'\x01')
f.write('⾀'.encode('utf-8'))
with open(file, 'rb') as f:
print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
print(f.read(1),'+', f.read(1))
If this does not work for you please provide an example.如果这对您不起作用,请提供一个示例。
The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string.
该文件是二进制文件,但文件中的特定范围是 UTF-8 编码字符串,长度位于字符串之前。
You have the length of the string, which is likely the byte length as it makes the most sense in a binary file.您有字符串的长度,这可能是字节长度,因为它在二进制文件中最有意义。 Read the range of bytes in binary mode and decode it after-the-fact.
以二进制模式读取字节范围并在事后对其进行解码。 Here's a contrived example of writing a binary file with a UTF-8 string with the length encoded first.
这是使用 UTF-8 字符串编写二进制文件的人为示例,该字符串首先编码长度。 It has a two-byte length followed by the encoded string data, surrounded with 10 bytes of random data on each side.
它有两个字节的长度,后跟编码的字符串数据,每边都有 10 个字节的随机数据。
import os
import struct
string = "我不喜欢你女朋友。你需要一个新的。"
with open('sample.bin','wb') as f:
f.write(os.urandom(10)) # write 10 random bytes
encoded = string.encode()
f.write(len(encoded).to_bytes(2,'big')) # write a two-byte big-endian length
f.write(encoded) # write string
f.write(os.urandom(10)) # 10 more random bytes
with open('sample.bin','rb') as f:
print(f.read()) # show the raw data
# Option 1: Seeking to the known offset, read the length, then the string
with open('sample.bin','rb') as f:
f.seek(10)
length = int.from_bytes(f.read(2),'big')
result = f.read(length).decode()
print(result)
# Option 2: read the fixed portion as a structure.
with open('sample.bin','rb') as f:
# read 10 bytes and a big endian 16-bit value
*other,header = struct.unpack('>10bH',f.read(12))
result = f.read(length).decode()
print(result)
Output: Output:
b'\xa3\x1e\x07S8\xb9LA\xf0_\x003\xe6\x88\x91\xe4\xb8\x8d\xe5\x96\x9c\xe6\xac\xa2\xe4\xbd\xa0\xe5\xa5\xb3\xe6\x9c\x8b\xe5\x8f\x8b\xe3\x80\x82\xe4\xbd\xa0\xe9\x9c\x80\xe8\xa6\x81\xe4\xb8\x80\xe4\xb8\xaa\xe6\x96\xb0\xe7\x9a\x84\xe3\x80\x82ta\xacg\x9c\x82\x85\x95\xf9\x8c'
我不喜欢你女朋友。你需要一个新的。
我不喜欢你女朋友。你需要一个新的。
If you do need to read UTF-8 characters from a particular byte offset in a file, you can wrap the binary stream in a UTF-8 reader after seeking:如果您确实需要从文件中的特定字节偏移量读取 UTF-8 字符,您可以在查找后将二进制 stream 包装在 UTF-8 阅读器中:
with open('sample.bin','rb') as f:
f.seek(12)
c = codecs.getreader('utf8')(f)
print(c.read(1))
Output: Output:
我
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.