如何從 Python 中的文件中讀取單個 UTF-8 字符？

Question

f.read(1)將返回 1 個字節，而不是一個字符。 該文件是二進制文件，但文件中的特定范圍是 UTF-8 編碼字符串，長度位於字符串之前。 字符串末尾沒有換行符。 我如何閱讀這樣的字符串？

我見過這個問題，但沒有一個答案涉及 UTF-8 案例。

示例代碼：

file = 'temp.txt'
with open(file, 'wb') as f:
    f.write(b'\x41')
    f.write(b'\xD0')
    f.write(b'\xB1')
    f.write(b'\xC0')

with open(file, 'rb') as f:
    print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
    print(f.buffer.read(1), '+', f.read(1))

這輸出：

UnicodeDecodeError：“utf-8”編解碼器無法解碼 position 2 中的字節 0xc0：無效的起始字節

當f.write(b'\xC0')被刪除時，它按預期工作。 它似乎比它被告知的要多：代碼沒有說要讀取 0xC0 字節。

Answer 1

這是一個占用超過一個字節的字符。 無論您是否打開提供 utf-8 編碼的文件，讀取一個字節似乎都可以完成工作，並且您會得到整個字符。

file = 'temp.txt'
with open(file, 'wb') as f:
    f.write('⾀'.encode('utf-8'))
    f.write(b'\x01')
    
with open(file, 'rb') as f:
    print(f.read(1))
with open(file, 'r') as f:
    print(f.read(1))

Output：

b'\xe2'
⾀

即使某些文件不是 utf-8，您仍然可以以讀取模式（非二進制）打開文件，跳到要讀取的字節，然后通過運行read(1)讀取整個字符。

即使您的角色不在文件的開頭，這也有效：

file = 'temp.txt'
with open(file, 'wb') as f:
    f.write(b'\x01')
    f.write('⾀'.encode('utf-8'))

    
with open(file, 'rb') as f:
    print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
    print(f.read(1),'+', f.read(1))

如果這對您不起作用，請提供一個示例。

Answer 2

該文件是二進制文件，但文件中的特定范圍是 UTF-8 編碼字符串，長度位於字符串之前。

您有字符串的長度，這可能是字節長度，因為它在二進制文件中最有意義。 以二進制模式讀取字節范圍並在事后對其進行解碼。 這是使用 UTF-8 字符串編寫二進制文件的人為示例，該字符串首先編碼長度。 它有兩個字節的長度，后跟編碼的字符串數據，每邊都有 10 個字節的隨機數據。

import os
import struct

string = "我不喜歡你女朋友。你需要一個新的。"

with open('sample.bin','wb') as f:
    f.write(os.urandom(10))  # write 10 random bytes
    encoded = string.encode()
    f.write(len(encoded).to_bytes(2,'big')) # write a two-byte big-endian length
    f.write(encoded)                        # write string
    f.write(os.urandom(10))                 # 10 more random bytes

with open('sample.bin','rb') as f:
    print(f.read())  # show the raw data

# Option 1: Seeking to the known offset, read the length, then the string
with open('sample.bin','rb') as f:
    f.seek(10)
    length = int.from_bytes(f.read(2),'big')
    result = f.read(length).decode()
    print(result)

# Option 2: read the fixed portion as a structure.
with open('sample.bin','rb') as f:
    # read 10 bytes and a big endian 16-bit value
    *other,header = struct.unpack('>10bH',f.read(12))
    result = f.read(length).decode()
    print(result)

Output：

b'\xa3\x1e\x07S8\xb9LA\xf0_\x003\xe6\x88\x91\xe4\xb8\x8d\xe5\x96\x9c\xe6\xac\xa2\xe4\xbd\xa0\xe5\xa5\xb3\xe6\x9c\x8b\xe5\x8f\x8b\xe3\x80\x82\xe4\xbd\xa0\xe9\x9c\x80\xe8\xa6\x81\xe4\xb8\x80\xe4\xb8\xaa\xe6\x96\xb0\xe7\x9a\x84\xe3\x80\x82ta\xacg\x9c\x82\x85\x95\xf9\x8c'
我不喜歡你女朋友。你需要一個新的。
我不喜歡你女朋友。你需要一個新的。

如果您確實需要從文件中的特定字節偏移量讀取 UTF-8 字符，您可以在查找后將二進制 stream 包裝在 UTF-8 閱讀器中：

with open('sample.bin','rb') as f:
    f.seek(12)
    c = codecs.getreader('utf8')(f)
    print(c.read(1))

Output：

我

如何從 Python 中的文件中讀取單個 UTF-8 字符？

問題描述

2 個解決方案

解決方案1
1 2021-05-18 16:18:34

解決方案2
1 已采納 2021-05-19 03:32:52

如何從 Python 中的文件中讀取單個 UTF-8 字符？

問題描述

2 個解決方案

解決方案1 1 2021-05-18 16:18:34

解決方案2 1 已采納 2021-05-19 03:32:52

解決方案1
1 2021-05-18 16:18:34

解決方案2
1 已采納 2021-05-19 03:32:52