简体   繁体   English

python读取文件utf-8解码问题

[英]python read file utf-8 decode issue

I am running into an issue with reading a file that has UTF8 and ASCII character. 我在读取具有UTF8和ASCII字符的文件时遇到问题。 The problem is I am using seek to only read some part of the data, but I have no idea if I am "read" in the "middle" of an UTF8. 问题是我正在使用搜寻仅读取部分数据,但是我不知道我是否在UTF8的“中间”中“读取”。

  • osx 操作系统
  • python 3.6.6 python 3.6.6

to simply it, my issue can demoed with following code. 简单来说,我的问题可以通过以下代码进行演示。

# write some utf-8 to a file
open('/tmp/test.txt', 'w').write(chr(12345)+chr(23456)+chr(34567)+'\n')
data = open('/tmp/test.txt')
data.read() # this works fine. to just demo I can read the file as whole
data.seek(1)
data.read(1) # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
# I can read by seek 3 by 3
data.seek(3)
data.read(1) # this works fine. 

I know I can open the file in binary then read it without issue by seeking to any position, however, I need to process the string, so I will end up with same issue when decode into string. 我知道我可以打开二进制文件,然后通过寻找任何位置读取它而不会出现问题,但是,我需要处理字符串,因此在解码为字符串时会遇到同样的问题。

data = open('/tmp/test.txt', 'rb')
data.seek(1)
z = data.seek(3)
z.decode() # will hit same error 

without using seek, I can read it correctly even just calling read(1). 无需使用seek,即使调用read(1),我也可以正确读取它。

data = open('/tmp/test.txt')
data.tell() # 0
data.read(1) 
data.tell() # shows 3 even calling read(1)

one thing I can think is after seek to a location, try to read, on UnicodeDecodeError, position = position -1, seek(position), until I can read it correctly. 我能想到的一件事是在搜寻到某个位置之后,尝试读取UnicodeDecodeError上的position = position -1,seek(position),直到我可以正确读取它为止。

Is there a better (right) way to handle it? 有更好的(正确)方法来处理它吗?

As the documentation explains, when you seek on text files: 如文档所述,当您seek文本文件时:

offset must either be a number returned by TextIOBase.tell() , or zero. offset必须为TextIOBase.tell()返回的TextIOBase.tell() ,或者为零。 Any other offset value produces undefined behaviour. 任何其他偏移值都会产生不确定的行为。

In practice, what seek(1) actually does is seek 1 byte into the file—which puts it in the middle of a character. 实际上, seek(1)实际作用是在文件中查找1个字节,这会将其放在字符的中间。 So, what ends up happening is similar to this: 因此,最终发生的事情与此类似:

>>> s = chr(12345)+chr(23456)+chr(34567)+'\n'
>>> b = s.encode()
>>> b
b'\xe3\x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:]
b'x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 3: invalid start byte

So, seek(3) happens to work, even though it's not legal, because you happen to be seeking to the start of a character. 因此,尽管这不合法,但seek(3)确实可以工作,因为您碰巧正在寻找角色的开始。 It's equivalent to this: 等效于:

>>> b[3:].decode()
'宠蜇\n'

If you want to rely on that undocumented behavior to try to seek randomly into the middle of a UTF-8 text file, you can usually get away with it by doing what you suggested. 如果您要依靠这种未记录的行为来尝试随机查找UTF-8文本文件的中间部分,通常可以通过执行建议来摆脱它。 For example: 例如:

def readchar(f, pos):
    for i in range(pos:pos+5):
        try:
            f.seek(i)
            return f.read(1)
        except UnicodeDecodeError:
            pass
    raise UnicodeDecodeError('Unable to find a UTF-8 start byte')

Or you could use knowledge of the UTF-8 encoding to manually scan for a valid start byte in a binary file: 或者,您可以使用UTF-8编码知识来手动扫描二进制文件中的有效起始字节:

def readchar(f, pos):
    f.seek(pos)
    for _ in range(5):
        byte = f.read(1)
        if byte in range(0, 0x80) or byte in range(0xC0, 0x100):
            return byte
    raise UnicodeDecodeError('Unable to find a UTF-8 start byte')

However, if you're actually just looking for the next complete line before or after some arbitrary point, that's a whole lot easier. 但是,如果您实际上只是在寻找任意点之前或之后的下一条完整行,那会容易得多。

In UTF-8, the newline character is encoded as a single byte, and the same byte as in ASCII—that is, '\\n' encodes to b'\\n' . 在UTF-8中,换行符编码为一个字节,与ASCII相同,即'\\n'编码为b'\\n' (If you have Windows-style endings, the same is true for return, so '\\r\\n' also encodes to b'\\r\\n' .) This is by design, to make it easier to handle this kind of problem. (如果您具有Windows风格的结尾,则返回也是如此,因此'\\r\\n'也编码为b'\\r\\n' 。)这是设计使然,可以更轻松地处理此类问题。

So, if you open the file in binary mode, you can seek forward or backward until you find a newline byte. 因此,如果以二进制模式打开文件,则可以向前或向后搜索,直到找到换行符为止。 And then, you can just use the (binary-file) readline method to read from there until the next newline. 然后,您可以只使用(二进制文件) readline方法从那里开始读取,直到下一个换行符为止。

The exact details depend on exactly what rule you want to use here. 具体细节取决于您要在此处使用什么规则。 Also, I'm going to show a stupid, completely unoptimized version that reads a character at a time; 另外,我将展示一个愚蠢的,完全未优化的版本,该版本一次读取一个字符。 in real life you probably want to back up, read, and scan (eg, with rfind ), say, 80 characters at a time, but this is hopefully simpler to understand: 在现实生活中,您可能想要一次备份,读取和扫描(例如,使用rfind ),例如一次80个字符,但这希望更容易理解:

def getline(f, pos, maxpos):
    for start in range(pos-1, -1, -1):
        f.seek(start)
        if f.read(1) == b'\n':
            break
    else:
        f.seek(0)
    return f.readline().decode()

Here it is in action: 它在起作用:

>>> s = ''.join(f'{i}:\u3039\u5ba0\u8707\n' for i in range(5))
>>> b = s.encode()
>>> f = io.BytesIO(b)
>>> maxlen = len(b)
>>> print(getline(f, 0, maxlen))
0:〹宠蜇
>>> print(getline(f, 1, maxlen))
0:〹宠蜇
>>> print(getline(f, 10, maxlen))
0:〹宠蜇
>>> print(getline(f, 11, maxlen))
0:〹宠蜇
>>> print(getline(f, 12, maxlen))
1:〹宠蜇
>>> print(getline(f, 59, maxlen))
4:〹宠蜇

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM