How to read a single UTF-8 character from a file in Python?

Question

f.read(1) will return 1 byte, not one character. The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string. There is no newline character at the end of the string. How do I read such strings?

I have seen this question but none of the answers address the UTF-8 case.

Example code:

file = 'temp.txt'
with open(file, 'wb') as f:
    f.write(b'\x41')
    f.write(b'\xD0')
    f.write(b'\xB1')
    f.write(b'\xC0')

with open(file, 'rb') as f:
    print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
    print(f.buffer.read(1), '+', f.read(1))

This outputs:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 2: invalid start byte

When f.write(b'\xC0') is removed, it works as expected. It seems to read more than it is told: the code doesn't say to read the 0xC0 byte.

Answer 1

Here's a character that takes up more than one byte. Whether you open the file giving the utf-8 encoding or not, reading one byte seems to do the job and you get the whole character.

file = 'temp.txt'
with open(file, 'wb') as f:
    f.write('⾀'.encode('utf-8'))
    f.write(b'\x01')
    
with open(file, 'rb') as f:
    print(f.read(1))
with open(file, 'r') as f:
    print(f.read(1))

Output:

b'\xe2'
⾀

Even though some of the file is non utf-8, you can still open the file in reading mode (non-binary), skip to the byte you want to read and then read a whole character by running read(1) .

This works even if your character isn't in the beginning of the file:

file = 'temp.txt'
with open(file, 'wb') as f:
    f.write(b'\x01')
    f.write('⾀'.encode('utf-8'))

    
with open(file, 'rb') as f:
    print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
    print(f.read(1),'+', f.read(1))

If this does not work for you please provide an example.

Answer 2

The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string.

You have the length of the string, which is likely the byte length as it makes the most sense in a binary file. Read the range of bytes in binary mode and decode it after-the-fact. Here's a contrived example of writing a binary file with a UTF-8 string with the length encoded first. It has a two-byte length followed by the encoded string data, surrounded with 10 bytes of random data on each side.

import os
import struct

string = "我不喜欢你女朋友。你需要一个新的。"

with open('sample.bin','wb') as f:
    f.write(os.urandom(10))  # write 10 random bytes
    encoded = string.encode()
    f.write(len(encoded).to_bytes(2,'big')) # write a two-byte big-endian length
    f.write(encoded)                        # write string
    f.write(os.urandom(10))                 # 10 more random bytes

with open('sample.bin','rb') as f:
    print(f.read())  # show the raw data

# Option 1: Seeking to the known offset, read the length, then the string
with open('sample.bin','rb') as f:
    f.seek(10)
    length = int.from_bytes(f.read(2),'big')
    result = f.read(length).decode()
    print(result)

# Option 2: read the fixed portion as a structure.
with open('sample.bin','rb') as f:
    # read 10 bytes and a big endian 16-bit value
    *other,header = struct.unpack('>10bH',f.read(12))
    result = f.read(length).decode()
    print(result)

Output:

b'\xa3\x1e\x07S8\xb9LA\xf0_\x003\xe6\x88\x91\xe4\xb8\x8d\xe5\x96\x9c\xe6\xac\xa2\xe4\xbd\xa0\xe5\xa5\xb3\xe6\x9c\x8b\xe5\x8f\x8b\xe3\x80\x82\xe4\xbd\xa0\xe9\x9c\x80\xe8\xa6\x81\xe4\xb8\x80\xe4\xb8\xaa\xe6\x96\xb0\xe7\x9a\x84\xe3\x80\x82ta\xacg\x9c\x82\x85\x95\xf9\x8c'
我不喜欢你女朋友。你需要一个新的。
我不喜欢你女朋友。你需要一个新的。

If you do need to read UTF-8 characters from a particular byte offset in a file, you can wrap the binary stream in a UTF-8 reader after seeking:

with open('sample.bin','rb') as f:
    f.seek(12)
    c = codecs.getreader('utf8')(f)
    print(c.read(1))

Output:

我

How to read a single UTF-8 character from a file in Python?

Question

2 answers

solution1
1 2021-05-18 16:18:34

solution2
1 ACCPTED 2021-05-19 03:32:52

How to read a single UTF-8 character from a file in Python?

Question

2 answers

solution1 1 2021-05-18 16:18:34

solution2 1 ACCPTED 2021-05-19 03:32:52

solution1
1 2021-05-18 16:18:34

solution2
1 ACCPTED 2021-05-19 03:32:52