[英]How to get the time needed for decompressing large bz2 files?
我需要使用 Python 處理大型 bz2 文件(~6G),方法是使用BZ2File.readline()
逐行解壓縮。 問題是我想知道處理整個文件需要多少時間。
我做了很多搜索,試圖獲得解壓文件的實際大小,這樣我就可以知道即時處理的百分比,以及剩余的時間,而發現似乎不可能知道解壓文件的大小無需先解壓縮( https://stackoverflow.com/a/12647847/7876675 )。
除了解壓縮文件需要大量內存之外,解壓縮本身也需要很多時間。 那么,有人可以幫助我即時獲得剩余的處理時間嗎?
您可以根據壓縮數據的消耗而不是未壓縮數據的產生來估計剩余時間。 如果數據相對同質,結果將大致相同。 (如果不是,那么無論如何使用輸入或輸出都不會給出准確的估計。)
您可以輕松找到壓縮文件的大小,並使用到目前為止在壓縮數據上花費的時間來估計處理剩余壓縮數據的時間。
這是一個使用BZ2Decompress
對象一次對輸入一個塊進行操作的簡單示例,顯示讀取進度(Python 3,從命令行獲取文件名):
# Decompress a bzip2 file, showing progress based on consumed input.
import sys
import os
import bz2
import time
def proc(input):
"""Decompress and process a piece of a compressed stream"""
dat = dec.decompress(input)
got = len(dat)
if got != 0: # 0 is common -- waiting for a bzip2 block
# process dat here
pass
return got
# Get the size of the compressed bzip2 file.
path = sys.argv[1]
size = os.path.getsize(path)
# Decompress CHUNK bytes at a time.
CHUNK = 16384
totin = 0
totout = 0
prev = -1
dec = bz2.BZ2Decompressor()
start = time.time()
with open(path, 'rb') as f:
for chunk in iter(lambda: f.read(CHUNK), b''):
# feed chunk to decompressor
got = proc(chunk)
# handle case of concatenated bz2 streams
if dec.eof:
rem = dec.unused_data
dec = bz2.BZ2Decompressor()
got += proc(rem)
# show progress
totin += len(chunk)
totout += got
if got != 0: # only if a bzip2 block emitted
frac = round(1000 * totin / size)
if frac != prev:
left = (size / totin - 1) * (time.time() - start)
print(f'\r{frac / 10:.1f}% (~{left:.1f}s left) ', end='')
prev = frac
# Show the resulting size.
print(end='\r')
print(totout, 'uncompressed bytes')
可以直接使用bz2
Python 模塊提供的現有高級 API,同時從底層文件處理程序獲取有關已處理多少壓縮數據的信息。
import bz2
import datetime
import time
with bz2.open(input_filename, 'rt', encoding='utf8') as input_file:
underlying_file = input_file.buffer._buffer.raw._fp
underlying_file.seek(0, io.SEEK_END)
underlying_file_size = underlying_file.tell()
underlying_file.seek(0, io.SEEK_SET)
lines_count = 0
start_time = time.perf_counter()
progress = f'{0:.2f}%'
while True:
line = input_file.readline().strip()
if not line:
break
process_line(line)
lines_count += 1
current_position = underlying_file.tell()
new_progress = f'{current_position / underlying_file_size * 100:.2f}%'
if progress != new_progress:
progress = new_progress
current_time = time.perf_counter()
elapsed_time = current_time - start_time
elapsed = datetime.timedelta(seconds=elapsed_time)
remaining = datetime.timedelta(seconds=(underlying_file_size / current_position - 1) * elapsed_time)
print(f"{lines_count} lines processed, {progress}, {elapsed} elapsed, {remaining} remaining")
如果您不是在讀取文本文件,而是讀取二進制文件,那么您必須使用:
with bz2.open(input_filename, 'rb') as input_file:
underlying_file = input_file._buffer.raw._fp
...
在另一個答案的幫助下,我終於找到了解決方案。 這個想法是使用處理的壓縮文件的大小,壓縮文件的總大小以及用於估計剩余時間的時間。 為達到這個,
byte_data
,速度相當快total_size = len(byte_data)
計算byte_data
的大小byte_data
包裝為byte_f = io.BytesIO(byte_data)
byte_f
包裝為bz2f = bz2.BZ2File(byte_f)
pos = byte_f.tell()
獲取壓縮文件中的當前位置percent = pos/total_size
幾秒鍾后,估計會變得非常准確:
0.01% processed, 2.00s elapsed, 17514.27s remaining...
0.02% processed, 4.00s elapsed, 20167.48s remaining...
0.03% processed, 6.00s elapsed, 21239.60s remaining...
0.04% processed, 8.00s elapsed, 21818.91s remaining...
0.05% processed, 10.00s elapsed, 22180.76s remaining...
0.05% processed, 12.00s elapsed, 22427.78s remaining...
0.06% processed, 14.00s elapsed, 22661.80s remaining...
0.07% processed, 16.00s elapsed, 22840.45s remaining...
0.08% processed, 18.00s elapsed, 22937.07s remaining...
....
99.97% processed, 22704.28s elapsed, 6.27s remaining...
99.98% processed, 22706.28s elapsed, 4.40s remaining...
99.99% processed, 22708.28s elapsed, 2.45s remaining...
100.00% processed, 22710.28s elapsed, 0.54s remaining...
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.