I need to process large bz2 files (~6G) using Python, by decompressing it line-by-line, using BZ2File.readline()
. The problem is that I want to know how much time is needed for processing the whole file.
I did a lot searches, tried to get the actual size of decompressed file, so that I can know the percentage processed on-the-fly, and hence the time remaining, while the finding is that it seems impossible to know the decompressed file size without decompressing it first ( https://stackoverflow.com/a/12647847/7876675 ).
Besides that decompressing the file takes loads of memory, decompressing takes a lot of time itself. So, can anybody help me to get the remaining processing time on-the-fly?
You can estimate the time remaining based on the consumption of compressed data, instead of the production of uncompressed data. The result will be about the same, if the data is relatively homogenous. (If it isn't, then either using the input or the output won't give an accurate estimate anyway.)
You can easily find the size of the compressed file, and use the time spent on the compressed data so far to estimate the time to process the remaining compressed data.
Here is a simple example of using a BZ2Decompress
object to operate on the input a chunk at a time, showing the read progress (Python 3, getting the file name from the command line):
# Decompress a bzip2 file, showing progress based on consumed input.
import sys
import os
import bz2
import time
def proc(input):
"""Decompress and process a piece of a compressed stream"""
dat = dec.decompress(input)
got = len(dat)
if got != 0: # 0 is common -- waiting for a bzip2 block
# process dat here
pass
return got
# Get the size of the compressed bzip2 file.
path = sys.argv[1]
size = os.path.getsize(path)
# Decompress CHUNK bytes at a time.
CHUNK = 16384
totin = 0
totout = 0
prev = -1
dec = bz2.BZ2Decompressor()
start = time.time()
with open(path, 'rb') as f:
for chunk in iter(lambda: f.read(CHUNK), b''):
# feed chunk to decompressor
got = proc(chunk)
# handle case of concatenated bz2 streams
if dec.eof:
rem = dec.unused_data
dec = bz2.BZ2Decompressor()
got += proc(rem)
# show progress
totin += len(chunk)
totout += got
if got != 0: # only if a bzip2 block emitted
frac = round(1000 * totin / size)
if frac != prev:
left = (size / totin - 1) * (time.time() - start)
print(f'\r{frac / 10:.1f}% (~{left:.1f}s left) ', end='')
prev = frac
# Show the resulting size.
print(end='\r')
print(totout, 'uncompressed bytes')
It is possible to use directly existing high-level APIs provided by the bz2
Python module and at the same time obtain information from the underlying file handler on how much compressed data has been processed.
import bz2
import datetime
import time
with bz2.open(input_filename, 'rt', encoding='utf8') as input_file:
underlying_file = input_file.buffer._buffer.raw._fp
underlying_file.seek(0, io.SEEK_END)
underlying_file_size = underlying_file.tell()
underlying_file.seek(0, io.SEEK_SET)
lines_count = 0
start_time = time.perf_counter()
progress = f'{0:.2f}%'
while True:
line = input_file.readline().strip()
if not line:
break
process_line(line)
lines_count += 1
current_position = underlying_file.tell()
new_progress = f'{current_position / underlying_file_size * 100:.2f}%'
if progress != new_progress:
progress = new_progress
current_time = time.perf_counter()
elapsed_time = current_time - start_time
elapsed = datetime.timedelta(seconds=elapsed_time)
remaining = datetime.timedelta(seconds=(underlying_file_size / current_position - 1) * elapsed_time)
print(f"{lines_count} lines processed, {progress}, {elapsed} elapsed, {remaining} remaining")
If you are not reading text files, but binary files, then you have to use:
with bz2.open(input_filename, 'rb') as input_file:
underlying_file = input_file._buffer.raw._fp
...
With the help of another answer, finally I found a solution. The idea is to use the size of compressed file processed, total size of compressed file, and the time used to estimate the remaining time. To achieve this,
byte_data
, which is quite fastbyte_data
using total_size = len(byte_data)
byte_data
as byte_f = io.BytesIO(byte_data)
byte_f
as bz2f = bz2.BZ2File(byte_f)
pos = byte_f.tell()
get current position in compressed filepercent = pos/total_size
After a few seconds, the estimation can become pretty accurate:
0.01% processed, 2.00s elapsed, 17514.27s remaining...
0.02% processed, 4.00s elapsed, 20167.48s remaining...
0.03% processed, 6.00s elapsed, 21239.60s remaining...
0.04% processed, 8.00s elapsed, 21818.91s remaining...
0.05% processed, 10.00s elapsed, 22180.76s remaining...
0.05% processed, 12.00s elapsed, 22427.78s remaining...
0.06% processed, 14.00s elapsed, 22661.80s remaining...
0.07% processed, 16.00s elapsed, 22840.45s remaining...
0.08% processed, 18.00s elapsed, 22937.07s remaining...
....
99.97% processed, 22704.28s elapsed, 6.27s remaining...
99.98% processed, 22706.28s elapsed, 4.40s remaining...
99.99% processed, 22708.28s elapsed, 2.45s remaining...
100.00% processed, 22710.28s elapsed, 0.54s remaining...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.