How to get the time needed for decompressing large bz2 files?

Question

I need to process large bz2 files (~6G) using Python, by decompressing it line-by-line, using BZ2File.readline() . The problem is that I want to know how much time is needed for processing the whole file.

I did a lot searches, tried to get the actual size of decompressed file, so that I can know the percentage processed on-the-fly, and hence the time remaining, while the finding is that it seems impossible to know the decompressed file size without decompressing it first ( https://stackoverflow.com/a/12647847/7876675 ).

Besides that decompressing the file takes loads of memory, decompressing takes a lot of time itself. So, can anybody help me to get the remaining processing time on-the-fly?

Answer 1

You can estimate the time remaining based on the consumption of compressed data, instead of the production of uncompressed data. The result will be about the same, if the data is relatively homogenous. (If it isn't, then either using the input or the output won't give an accurate estimate anyway.)

You can easily find the size of the compressed file, and use the time spent on the compressed data so far to estimate the time to process the remaining compressed data.

Here is a simple example of using a BZ2Decompress object to operate on the input a chunk at a time, showing the read progress (Python 3, getting the file name from the command line):

# Decompress a bzip2 file, showing progress based on consumed input.

import sys
import os
import bz2
import time

def proc(input):
    """Decompress and process a piece of a compressed stream"""
    dat = dec.decompress(input)
    got = len(dat)
    if got != 0:    # 0 is common -- waiting for a bzip2 block
        # process dat here
        pass
    return got

# Get the size of the compressed bzip2 file.
path = sys.argv[1]
size = os.path.getsize(path)

# Decompress CHUNK bytes at a time.
CHUNK = 16384
totin = 0
totout = 0
prev = -1
dec = bz2.BZ2Decompressor()
start = time.time()
with open(path, 'rb') as f:
    for chunk in iter(lambda: f.read(CHUNK), b''):
        # feed chunk to decompressor
        got = proc(chunk)

        # handle case of concatenated bz2 streams
        if dec.eof:
            rem = dec.unused_data
            dec = bz2.BZ2Decompressor()
            got += proc(rem)

        # show progress
        totin += len(chunk)
        totout += got
        if got != 0:    # only if a bzip2 block emitted
            frac = round(1000 * totin / size)
            if frac != prev:
                left = (size / totin - 1) * (time.time() - start)
                print(f'\r{frac / 10:.1f}% (~{left:.1f}s left) ', end='')
                prev = frac

# Show the resulting size.
print(end='\r')
print(totout, 'uncompressed bytes')

Answer 2

It is possible to use directly existing high-level APIs provided by the bz2 Python module and at the same time obtain information from the underlying file handler on how much compressed data has been processed.

import bz2
import datetime
import time

with bz2.open(input_filename, 'rt', encoding='utf8') as input_file:
    underlying_file = input_file.buffer._buffer.raw._fp
    underlying_file.seek(0, io.SEEK_END)
    underlying_file_size = underlying_file.tell()
    underlying_file.seek(0, io.SEEK_SET)
    lines_count = 0
    start_time = time.perf_counter()
    progress = f'{0:.2f}%'

    while True:
        line = input_file.readline().strip()
        if not line:
            break

        process_line(line)

        lines_count += 1
        current_position = underlying_file.tell()
        new_progress = f'{current_position / underlying_file_size * 100:.2f}%'
        if progress != new_progress:
            progress = new_progress
            current_time = time.perf_counter()
            elapsed_time = current_time - start_time
            elapsed = datetime.timedelta(seconds=elapsed_time)
            remaining = datetime.timedelta(seconds=(underlying_file_size / current_position - 1) * elapsed_time)
            print(f"{lines_count} lines processed, {progress}, {elapsed} elapsed, {remaining} remaining")

If you are not reading text files, but binary files, then you have to use:

with bz2.open(input_filename, 'rb') as input_file:
    underlying_file = input_file._buffer.raw._fp
    ...

Answer 3

With the help of another answer, finally I found a solution. The idea is to use the size of compressed file processed, total size of compressed file, and the time used to estimate the remaining time. To achieve this,

read the compressed file as a byte object into memory: byte_data , which is quite fast
calculate the size of byte_data using total_size = len(byte_data)
wrap byte_data as byte_f = io.BytesIO(byte_data)
wrap byte_f as bz2f = bz2.BZ2File(byte_f)
during processing, use pos = byte_f.tell() get current position in compressed file
calculate the exact percentage processed percent = pos/total_size
record time used, and calculate time remaining

After a few seconds, the estimation can become pretty accurate:

0.01% processed, 2.00s elapsed, 17514.27s remaining...
0.02% processed, 4.00s elapsed, 20167.48s remaining...
0.03% processed, 6.00s elapsed, 21239.60s remaining...
0.04% processed, 8.00s elapsed, 21818.91s remaining...
0.05% processed, 10.00s elapsed, 22180.76s remaining...
0.05% processed, 12.00s elapsed, 22427.78s remaining...
0.06% processed, 14.00s elapsed, 22661.80s remaining...
0.07% processed, 16.00s elapsed, 22840.45s remaining...
0.08% processed, 18.00s elapsed, 22937.07s remaining...
....
99.97% processed, 22704.28s elapsed, 6.27s remaining...
99.98% processed, 22706.28s elapsed, 4.40s remaining...
99.99% processed, 22708.28s elapsed, 2.45s remaining...
100.00% processed, 22710.28s elapsed, 0.54s remaining...

How to get the time needed for decompressing large bz2 files?

Question

3 answers

solution1
2 ACCPTED 2019-02-09 03:16:42

solution2
0 2021-03-23 05:00:54

solution3
-1 2019-02-09 15:24:51

How to get the time needed for decompressing large bz2 files?

Question

3 answers

solution1 2 ACCPTED 2019-02-09 03:16:42

solution2 0 2021-03-23 05:00:54

solution3 -1 2019-02-09 15:24:51

solution1
2 ACCPTED 2019-02-09 03:16:42

solution2
0 2021-03-23 05:00:54

solution3
-1 2019-02-09 15:24:51