从二进制文件中读取UTF-8字符串

Question

I have some files which contains a bunch of different kinds of binary data and I'm writing a module to deal with these files. 我有一些文件包含一堆不同种类的二进制数据，我正在编写一个模块来处理这些文件。

Amongst other, it contains UTF-8 encoded strings in the following format: 2 bytes big endian stringLength (which I parse using struct.unpack()) and then the string. 其中，它包含以下格式的UTF-8编码字符串：2字节大端字符串stringLength （我使用struct.unpack（）解析）然后是字符串。 Since it's UTF-8, the length in bytes of the string may be greater than stringLength and doing read(stringLength) will come up short if the string contains multi-byte characters (not to mention messing up all the other data in the file). 因为它是UTF-8，所以字符串的字节长度可能大于stringLength ，如果字符串包含多字节字符，则读取（stringLength）会变短（更不用说弄乱文件中的所有其他数据）。

How do I read n UTF-8 characters (distinct from n bytes) from a file, being aware of the multi-byte properties of UTF-8? 如何从文件中读取n个 UTF-8字符（与n个字节不同），知道UTF-8的多字节属性？ I've been googling for half an hour and all the results I've found are either not relevant or makes assumptions that I cannot make. 我一直在谷歌搜索半小时，我发现的所有结果要么不相关，要么做出我无法做出的假设。

Answer 1

Given a file object, and a number of characters, you can use: 给定一个文件对象和一些字符，您可以使用：

# build a table mapping lead byte to expected follow-byte count
# bytes 00-BF have 0 follow bytes, F5-FF is not legal UTF8
# C0-DF: 1, E0-EF: 2 and F0-F4: 3 follow bytes.
# leave F5-FF set to 0 to minimize reading broken data.
_lead_byte_to_count = []
for i in range(256):
    _lead_byte_to_count.append(
        1 + (i >= 0xe0) + (i >= 0xf0) if 0xbf < i < 0xf5 else 0)

def readUTF8(f, count):
    """Read `count` UTF-8 bytes from file `f`, return as unicode"""
    # Assumes UTF-8 data is valid; leaves it up to the `.decode()` call to validate
    res = []
    while count:
        count -= 1
        lead = f.read(1)
        res.append(lead)
        readcount = _lead_byte_to_count[ord(lead)]
        if readcount:
            res.append(f.read(readcount))
    return (''.join(res)).decode('utf8')

Result of a test: 测试结果：

>>> test = StringIO(u'This is a test containing Unicode data: \ua000'.encode('utf8'))
>>> readUTF8(test, 41)
u'This is a test containing Unicode data: \ua000'

Answer 2

One character in UTF-8 can be 1byte,2bytes,3byte3. UTF-8中的一个字符可以是1byte，2bytes，3byte3。

If you have to read your file byte by byte, you have to follow the UTF-8 encoding rules. 如果必须逐字节读取文件，则必须遵循UTF-8编码规则。 http://en.wikipedia.org/wiki/UTF-8 http://en.wikipedia.org/wiki/UTF-8

Most the time, you can just set the encoding to utf-8, and read the input stream. 大多数情况下，您只需将编码设置为utf-8，然后读取输入流。

You do not need to care how much bytes you have read. 您不需要关心已读取的字节数。

从二进制文件中读取UTF-8字符串

问题描述

2 个解决方案

解决方案1
5 已采纳 2013-03-04 11:23:19

解决方案2
0 2013-03-04 10:51:30

从二进制文件中读取UTF-8字符串

问题描述

2 个解决方案

解决方案1 5 已采纳 2013-03-04 11:23:19

解决方案2 0 2013-03-04 10:51:30

解决方案1
5 已采纳 2013-03-04 11:23:19

解决方案2
0 2013-03-04 10:51:30