加快讀取壓縮 bz2 文件（'rb' 模式）

Question

我有一個超過 10GB 的 BZ2 文件。 我想在不將其解壓縮為臨時文件的情況下閱讀它（它將超過 50GB）。

使用這種方法：

import bz2, time
t0 = time.time()
time.sleep(0.001) # to avoid / by 0
with bz2.open("F:\test.bz2", 'rb') as f:
    for i, l in enumerate(f):
        if i % 100000 == 0:
            print('%i lines/sec' % (i/(time.time() - t0)))

我每秒只能讀取 ~ 250k 行。 在一個類似的文件上，首先解壓縮，我每秒得到大約 3M 行，即 x10 因子：

with open("F:\test.txt", 'rb') as f:

我認為這不僅是由於固有的解壓 CPU 時間（因為解壓到臨時文件的總時間 + 讀取為未壓縮文件的總時間遠小於此處描述的方法），還可能是由於缺乏緩沖或其他原因。 是否還有其他更快的 Python 實現bz2.open ？

如何以二進制模式加速 BZ2 文件的讀取並循環“行”？ （由\n分隔）

注意：目前time to decompress test.bz2 into test.tmp + time to iterate over lines of test.tmp time to iterate over lines of bz2.open('test.bz2') ，這可能不應該是案子。

鏈接主題：https://discuss.python.org/t/non-optimal-bz2-reading-speed/6869

Answer 1

您可以使用BZ2Decompressor來處理大文件。 它以增量方式解壓縮數據塊，開箱即用：

t0 = time.time()
time.sleep(0.000001)
with open('temp.bz2', 'rb') as fi:
    decomp = bz2.BZ2Decompressor()
    residue = b''
    total_lines = 0
    for data in iter(lambda: fi.read(100 * 1024), b''):
        raw = residue + decomp.decompress(data) # process the raw data and  concatenate residual of the previous block to the beginning of the current raw data block
        residue = b''
        # process_data(current_block) => do the processing of the current data block
        current_block = raw.split(b'\n')
        if raw[-1] != b'\n':
            residue = current_block.pop() # last line could be incomplete
        total_lines += len(current_block)
        print('%i lines/sec' % (total_lines / (time.time() - t0)))
    # process_data(residue) => now finish processing the last line
    total_lines += 1
    print('Final: %i lines/sec' % (total_lines / (time.time() - t0)))

在這里，我讀取了一大塊二進制文件，將其輸入解壓縮器並接收一大塊解壓縮數據。 請注意，必須將解壓縮的數據塊連接起來才能恢復原始數據。 這就是為什么最后一個條目需要特殊處理的原因。

在我的實驗中，它的運行速度比您使用io.BytesIO()的解決方案要快一些。 眾所周知bz2速度很慢，所以如果它打擾您考慮遷移到snappy或zstandard 。

關於在 Python 中處理bz2所需的時間。 使用 Linux 實用程序將文件解壓縮為臨時文件可能是最快的，然后處理普通文本文件。 否則，您將依賴 Python 的bz2實現。

Answer 2

這種方法已經比原生bz2.open提供了 x2 的改進。

import bz2, time, io

def chunked_readlines(f):
    s = io.BytesIO()
    while True:
        buf = f.read(1024*1024)
        if not buf:
            return s.getvalue()
        s.write(buf)
        s.seek(0)
        L = s.readlines()
        yield from L[:-1]
        s = io.BytesIO()
        s.write(L[-1])  # very important: the last line read in the 1 MB chunk might be
                        # incomplete, so we keep it to be processed in the next iteration
                        # TODO: check if this is ok if f.read() stopped in the middle of a \r\n?

t0 = time.time()
i = 0
with bz2.open("D:\test.bz2", 'rb') as f:
    for l in chunked_readlines(f):       # 500k lines per second
    # for l in f:                        # 250k lines per second
        i += 1
        if i % 100000 == 0:
            print('%i lines/sec' % (i/(time.time() - t0)))

或許可以做得更好。

如果我們可以使用s作為簡單的bytes object 而不是io.BytesIO ，我們可以得到 x4 的改進。 但不幸的是，在這種情況下， splitlines()的行為並不像預期的那樣： splitlines() 和迭代打開的文件會給出不同的結果。

加快讀取壓縮 bz2 文件（'rb' 模式）

問題描述

2 個解決方案

解決方案1
1 已采納 2021-01-29 01:19:47

解決方案2
0 2021-01-17 20:51:52

加快讀取壓縮 bz2 文件（'rb' 模式）

問題描述

2 個解決方案

解決方案1 1 已采納 2021-01-29 01:19:47

解決方案2 0 2021-01-17 20:51:52

解決方案1
1 已采納 2021-01-29 01:19:47

解決方案2
0 2021-01-17 20:51:52