简体   繁体   中英

Boost 1.59 not decompressing all bzip2 streams

I've been trying to decompress some .bz2 files on the fly and line-by-line so to speak as the files I'm dealing with are massive uncompressed (region of 100 GB uncompressed) so I wanted to add a solution that saves disk space.

I have no problems decompressing using files compressed with vanilla bzip2 but files compressed with pbzip2 only decompress the first bz2 stream it finds. This bugtracker relates to the problem: https://svn.boost.org/trac/boost/ticket/3853 but I was lead to believe it was fixed past version 1.41. I've checked the bzip2.hpp file and it contains the 'fixed' version and I've also checked that the version of Boost used in the program is 1.59.

The code is here:

cout<<"Warning bzip2 support is a little buggy!"<<endl;

//Open the file here
trans_file.open(files[i].c_str(), std::ios_base::in |  std::ios_base::binary);

//Set up boost bzip2 compression
boost::iostreams::filtering_istream in;
in.push(boost::iostreams::bzip2_decompressor());
in.push(trans_file);
std::string str;

//Begin reading
while(std::getline(in, str))
{
    std::stringstream stream(str);
    stream>>id_f>>id_i>>aif;
    /* Do stuff with values here*/
}

Any suggestions would be great. Thanks!

You are right.

It seems that changeset #63057 only fixes part of the issue.

The corresponding unit-test does work, though. But it uses the copy algorithm (also on a composite<> instead of a filtering_istream , if that is relevant).

I'd open this as a defect or a regression. Include a file that exhibits the problem, of course. For me it's reproduced using just /etc/dictionaries-common/words compressed with pbzip2 (default options).

I have the test.bz2 here: http://7f0d2fd2-af79-415c-ab60-033d3b494dc9.s3.amazonaws.com/test.bz2

Here's my test program:

#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/bzip2.hpp>
#include <boost/iostreams/stream.hpp>
#include <fstream>
#include <iostream>

namespace io = boost::iostreams;

void multiple_member_test(); // from the unit tests in changeset #63057

int main() {
    //multiple_member_test();
    //return 0;

    std::ifstream trans_file("test.bz2", std::ios::binary);

    //Set up boost bzip2 compression
    io::filtering_istream in;
    in.push(io::bzip2_decompressor());
    in.push(trans_file);

    //Begin reading
    std::string str;
    while(std::getline(in, str))
    {
        std::cout << str << "\n";
    }
}

#include <boost/iostreams/compose.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/device/array.hpp>
#include <boost/iostreams/device/back_inserter.hpp>
#include <cassert>
#include <sstream>

void multiple_member_test()  // from the unit tests in changeset #63057
{ 
    std::string      data(20ul << 20, '*');
    std::vector<char>  temp, dest; 

    // Write compressed data to temp, twice in succession 
    io::filtering_ostream out; 
    out.push(io::bzip2_compressor()); 
    out.push(io::back_inserter(temp)); 
    io::copy(boost::make_iterator_range(data), out); 
    out.push(io::back_inserter(temp)); 
    io::copy(boost::make_iterator_range(data), out); 

    // Read compressed data from temp into dest 
    io::filtering_istream in; 
    in.push(io::bzip2_decompressor()); 
    in.push(io::array_source(&temp[0], temp.size())); 
    io::copy(in, io::back_inserter(dest)); 

    // Check that dest consists of two copies of data 
    assert(data.size() * 2 == dest.size()); 
    assert(std::equal(data.begin(), data.end(), dest.begin())); 
    assert(std::equal(data.begin(), data.end(), dest.begin() + dest.size() / 2)); 

    dest.clear(); 
    io::copy( 
            io::array_source(&temp[0], temp.size()), 
            io::compose(io::bzip2_decompressor(), io::back_inserter(dest))); 

    // Check that dest consists of two copies of data 
    assert(data.size() * 2 == dest.size()); 
    assert(std::equal(data.begin(), data.end(), dest.begin())); 
    assert(std::equal(data.begin(), data.end(), dest.begin() + dest.size() / 2)); 
} 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM