简体   繁体   English

如何获得解压缩大型 bz2 文件所需的时间?

[英]How to get the time needed for decompressing large bz2 files?

I need to process large bz2 files (~6G) using Python, by decompressing it line-by-line, using BZ2File.readline() .我需要使用 Python 处理大型 bz2 文件(~6G),方法是使用BZ2File.readline()逐行解压缩。 The problem is that I want to know how much time is needed for processing the whole file.问题是我想知道处理整个文件需要多少时间。

I did a lot searches, tried to get the actual size of decompressed file, so that I can know the percentage processed on-the-fly, and hence the time remaining, while the finding is that it seems impossible to know the decompressed file size without decompressing it first ( https://stackoverflow.com/a/12647847/7876675 ).我做了很多搜索,试图获得解压文件的实际大小,这样我就可以知道即时处理的百分比,以及剩余的时间,而发现似乎不可能知道解压文件的大小无需先解压缩( https://stackoverflow.com/a/12647847/7876675 )。

Besides that decompressing the file takes loads of memory, decompressing takes a lot of time itself.除了解压缩文件需要大量内存之外,解压缩本身也需要很多时间。 So, can anybody help me to get the remaining processing time on-the-fly?那么,有人可以帮助我即时获得剩余的处理时间吗?

You can estimate the time remaining based on the consumption of compressed data, instead of the production of uncompressed data.您可以根据压缩数据的消耗而不是未压缩数据的产生来估计剩余时间。 The result will be about the same, if the data is relatively homogenous.如果数据相对同质,结果将大致相同。 (If it isn't, then either using the input or the output won't give an accurate estimate anyway.) (如果不是,那么无论如何使用输入或输出都不会给出准确的估计。)

You can easily find the size of the compressed file, and use the time spent on the compressed data so far to estimate the time to process the remaining compressed data.您可以轻松找到压缩文件的大小,并使用到目前为止在压缩数据上花费的时间来估计处理剩余压缩数据的时间。

Here is a simple example of using a BZ2Decompress object to operate on the input a chunk at a time, showing the read progress (Python 3, getting the file name from the command line):这是一个使用BZ2Decompress对象一次对输入一个块进行操作的简单示例,显示读取进度(Python 3,从命令行获取文件名):

# Decompress a bzip2 file, showing progress based on consumed input.

import sys
import os
import bz2
import time

def proc(input):
    """Decompress and process a piece of a compressed stream"""
    dat = dec.decompress(input)
    got = len(dat)
    if got != 0:    # 0 is common -- waiting for a bzip2 block
        # process dat here
        pass
    return got

# Get the size of the compressed bzip2 file.
path = sys.argv[1]
size = os.path.getsize(path)

# Decompress CHUNK bytes at a time.
CHUNK = 16384
totin = 0
totout = 0
prev = -1
dec = bz2.BZ2Decompressor()
start = time.time()
with open(path, 'rb') as f:
    for chunk in iter(lambda: f.read(CHUNK), b''):
        # feed chunk to decompressor
        got = proc(chunk)

        # handle case of concatenated bz2 streams
        if dec.eof:
            rem = dec.unused_data
            dec = bz2.BZ2Decompressor()
            got += proc(rem)

        # show progress
        totin += len(chunk)
        totout += got
        if got != 0:    # only if a bzip2 block emitted
            frac = round(1000 * totin / size)
            if frac != prev:
                left = (size / totin - 1) * (time.time() - start)
                print(f'\r{frac / 10:.1f}% (~{left:.1f}s left) ', end='')
                prev = frac

# Show the resulting size.
print(end='\r')
print(totout, 'uncompressed bytes')

It is possible to use directly existing high-level APIs provided by the bz2 Python module and at the same time obtain information from the underlying file handler on how much compressed data has been processed.可以直接使用bz2 Python 模块提供的现有高级 API,同时从底层文件处理程序获取有关已处理多少压缩数据的信息。

import bz2
import datetime
import time

with bz2.open(input_filename, 'rt', encoding='utf8') as input_file:
    underlying_file = input_file.buffer._buffer.raw._fp
    underlying_file.seek(0, io.SEEK_END)
    underlying_file_size = underlying_file.tell()
    underlying_file.seek(0, io.SEEK_SET)
    lines_count = 0
    start_time = time.perf_counter()
    progress = f'{0:.2f}%'

    while True:
        line = input_file.readline().strip()
        if not line:
            break

        process_line(line)

        lines_count += 1
        current_position = underlying_file.tell()
        new_progress = f'{current_position / underlying_file_size * 100:.2f}%'
        if progress != new_progress:
            progress = new_progress
            current_time = time.perf_counter()
            elapsed_time = current_time - start_time
            elapsed = datetime.timedelta(seconds=elapsed_time)
            remaining = datetime.timedelta(seconds=(underlying_file_size / current_position - 1) * elapsed_time)
            print(f"{lines_count} lines processed, {progress}, {elapsed} elapsed, {remaining} remaining")

If you are not reading text files, but binary files, then you have to use:如果您不是在读取文本文件,而是读取二进制文件,那么您必须使用:

with bz2.open(input_filename, 'rb') as input_file:
    underlying_file = input_file._buffer.raw._fp
    ...

With the help of another answer, finally I found a solution.在另一个答案的帮助下,我终于找到了解决方案。 The idea is to use the size of compressed file processed, total size of compressed file, and the time used to estimate the remaining time.这个想法是使用处理的压缩文件的大小,压缩文件的总大小以及用于估计剩余时间的时间。 To achieve this,为达到这个,

  1. read the compressed file as a byte object into memory: byte_data , which is quite fast将压缩文件作为字节对象读入内存: byte_data ,速度相当快
  2. calculate the size of byte_data using total_size = len(byte_data)使用total_size = len(byte_data)计算byte_data的大小
  3. wrap byte_data as byte_f = io.BytesIO(byte_data)byte_data包装为byte_f = io.BytesIO(byte_data)
  4. wrap byte_f as bz2f = bz2.BZ2File(byte_f)byte_f包装为bz2f = bz2.BZ2File(byte_f)
  5. during processing, use pos = byte_f.tell() get current position in compressed file在处理过程中,使用pos = byte_f.tell()获取压缩文件中的当前位置
  6. calculate the exact percentage processed percent = pos/total_size计算精确的百分比处理percent = pos/total_size
  7. record time used, and calculate time remaining记录使用时间,并计算剩余时间

After a few seconds, the estimation can become pretty accurate:几秒钟后,估计会变得非常准确:

0.01% processed, 2.00s elapsed, 17514.27s remaining...
0.02% processed, 4.00s elapsed, 20167.48s remaining...
0.03% processed, 6.00s elapsed, 21239.60s remaining...
0.04% processed, 8.00s elapsed, 21818.91s remaining...
0.05% processed, 10.00s elapsed, 22180.76s remaining...
0.05% processed, 12.00s elapsed, 22427.78s remaining...
0.06% processed, 14.00s elapsed, 22661.80s remaining...
0.07% processed, 16.00s elapsed, 22840.45s remaining...
0.08% processed, 18.00s elapsed, 22937.07s remaining...
....
99.97% processed, 22704.28s elapsed, 6.27s remaining...
99.98% processed, 22706.28s elapsed, 4.40s remaining...
99.99% processed, 22708.28s elapsed, 2.45s remaining...
100.00% processed, 22710.28s elapsed, 0.54s remaining...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM