python讀取文件utf-8解碼問題

Question

我在讀取具有UTF8和ASCII字符的文件時遇到問題。 問題是我正在使用搜尋僅讀取部分數據，但是我不知道我是否在UTF8的“中間”中“讀取”。

操作系統
python 3.6.6

簡單來說，我的問題可以通過以下代碼進行演示。

# write some utf-8 to a file
open('/tmp/test.txt', 'w').write(chr(12345)+chr(23456)+chr(34567)+'\n')
data = open('/tmp/test.txt')
data.read() # this works fine. to just demo I can read the file as whole
data.seek(1)
data.read(1) # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
# I can read by seek 3 by 3
data.seek(3)
data.read(1) # this works fine.

我知道我可以打開二進制文件，然后通過尋找任何位置讀取它而不會出現問題，但是，我需要處理字符串，因此在解碼為字符串時會遇到同樣的問題。

data = open('/tmp/test.txt', 'rb')
data.seek(1)
z = data.seek(3)
z.decode() # will hit same error

無需使用seek，即使調用read（1），我也可以正確讀取它。

data = open('/tmp/test.txt')
data.tell() # 0
data.read(1) 
data.tell() # shows 3 even calling read(1)

我能想到的一件事是在搜尋到某個位置之后，嘗試讀取UnicodeDecodeError上的position = position -1，seek（position），直到我可以正確讀取它為止。

有更好的（正確）方法來處理它嗎？

Answer 1

如文檔所述，當您seek文本文件時：

offset必須為TextIOBase.tell()返回的TextIOBase.tell() ，或者為零。 任何其他偏移值都會產生不確定的行為。

實際上， seek(1)實際作用是在文件中查找1個字節，這會將其放在字符的中間。 因此，最終發生的事情與此類似：

>>> s = chr(12345)+chr(23456)+chr(34567)+'\n'
>>> b = s.encode()
>>> b
b'\xe3\x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:]
b'x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 3: invalid start byte

因此，盡管這不合法，但seek(3)確實可以工作，因為您碰巧正在尋找角色的開始。 等效於：

>>> b[3:].decode()
'寵蜇\n'

如果您要依靠這種未記錄的行為來嘗試隨機查找UTF-8文本文件的中間部分，通常可以通過執行建議來擺脫它。 例如：

def readchar(f, pos):
    for i in range(pos:pos+5):
        try:
            f.seek(i)
            return f.read(1)
        except UnicodeDecodeError:
            pass
    raise UnicodeDecodeError('Unable to find a UTF-8 start byte')

或者，您可以使用UTF-8編碼知識來手動掃描二進制文件中的有效起始字節：

def readchar(f, pos):
    f.seek(pos)
    for _ in range(5):
        byte = f.read(1)
        if byte in range(0, 0x80) or byte in range(0xC0, 0x100):
            return byte
    raise UnicodeDecodeError('Unable to find a UTF-8 start byte')

但是，如果您實際上只是在尋找任意點之前或之后的下一條完整行，那會容易得多。

在UTF-8中，換行符編碼為一個字節，與ASCII相同，即'\\n'編碼為b'\\n' 。 （如果您具有Windows風格的結尾，則返回也是如此，因此'\\r\\n'也編碼為b'\\r\\n' 。）這是設計使然，可以更輕松地處理此類問題。

因此，如果以二進制模式打開文件，則可以向前或向后搜索，直到找到換行符為止。 然后，您可以只使用（二進制文件） readline方法從那里開始讀取，直到下一個換行符為止。

具體細節取決於您要在此處使用什么規則。 另外，我將展示一個愚蠢的，完全未優化的版本，該版本一次讀取一個字符。 在現實生活中，您可能想要一次備份，讀取和掃描（例如，使用rfind ），例如一次80個字符，但這希望更容易理解：

def getline(f, pos, maxpos):
    for start in range(pos-1, -1, -1):
        f.seek(start)
        if f.read(1) == b'\n':
            break
    else:
        f.seek(0)
    return f.readline().decode()

它在起作用：

>>> s = ''.join(f'{i}:\u3039\u5ba0\u8707\n' for i in range(5))
>>> b = s.encode()
>>> f = io.BytesIO(b)
>>> maxlen = len(b)
>>> print(getline(f, 0, maxlen))
0:〹寵蜇
>>> print(getline(f, 1, maxlen))
0:〹寵蜇
>>> print(getline(f, 10, maxlen))
0:〹寵蜇
>>> print(getline(f, 11, maxlen))
0:〹寵蜇
>>> print(getline(f, 12, maxlen))
1:〹寵蜇
>>> print(getline(f, 59, maxlen))
4:〹寵蜇

python讀取文件utf-8解碼問題

問題描述

1 個解決方案

解決方案1
2 2018-07-02 19:12:55

python讀取文件utf-8解碼問題

問題描述

1 個解決方案

解決方案1 2 2018-07-02 19:12:55

解決方案1
2 2018-07-02 19:12:55