alternative to using std::getline for boost::iostream::filtering_istream

Question

I have a binary file compressed in gz which I wish to stream using boost::iostream . After searching the web the past few hours, I have found a nice code snippet that does what I want, except for std::getline :

try 
{
    std::ifstream file("../data.txt.gz", std::ios_base::in | std::ios_base::binary);
    boost::iostreams::filtering_istream in;
    in.push(boost::iostreams::gzip_decompressor());
    in.push(file);
    std::vector<std::byte> buffer;
    for(std::string str; std::getline(in, str); )
    {
        std::cout << "str length: " << str.length() << '\n';
        for(auto c : str){
            buffer.push_back(std::byte(c));
        }
        std::cout << "buffer size: " << buffer.size() << '\n';
        // process buffer 
        // ...
        // ...
    }
}
catch(const boost::iostreams::gzip_error& e) {
        std::cout << e.what() << '\n';
}

I want to read the file, store it into some intermediary buffer, and fill up the buffer as I stream the file. However, std::getline uses \n delimiter, and when it does, does not include the delimiter in the output string.

Is there a way I could read, for instance, 2048 bytes of data at a time?

Answer 1

Uncompressing the gzip stream the way you want isn't exactly straight forward. One option is using boost::iostreams::copy to uncompress the gzip stream into the vector but since you are wanting to decompress the stream in chunks (2k mentioned in your post) that may not be an option.

Now normally with an input stream it's as simple as calling the read() function on the stream specifying the buffer and number of bytes to read in and then calling gcount() to determine how many bytes were actually read. Unfortunately it seems that there is either bug in either filtering_istream or gzip_decompressor or possibly that gcount is not supported (it should be) as it always seems to return the number of bytes requested instead of the actual bytes read. As you might imagine this can cause problems when reading the last few bytes of the file unless you know ahead of time how many bytes to read.

Fortunately the size of the uncompressed data is stored at the end of the gzip file which means we can account for that but we just have to work a little bit harder in the decompression loop.

Below is the code I came up with to handle uncompressing the stream in the way you would like. It creates a two vectors - one for decompressing each 2k chunk and one for the final buffer. It's quite basic and I haven't done anything to really optimize memory usage on the vectors but if that's an issue I suggest switching to a single vector, resize it to the length of the uncompressed data, and call read passing an offset into the vector data for the 2k chunk being read.

#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <fstream>
#include <iostream>
#include <utility>

int main()
{
    namespace io = boost::iostreams;

    std::ifstream file("../data.txt.gz", std::ios_base::in | std::ios_base::binary);

    // Get the uncompressed size (stored in big endian, assume we're BE)
    uint32_t dataLeft;
    file.seekg(-4, std::ios_base::end);
    file.read(reinterpret_cast<char*>(&dataLeft), sizeof(dataLeft));
    file.seekg(0);

    // Set up the gzip stream
    io::filtering_istream in;
    in.push(io::gzip_decompressor());
    in.push(file);

    std::vector<std::byte> buffer, tmp(2048);
    for (auto toRead(std::min(tmp.size(), dataLeft));
        dataLeft && in.read(reinterpret_cast<char*>(tmp.data()), toRead);
        dataLeft -= toRead, toRead = std::min(tmp.size(), dataLeft))
    {
        tmp.resize(toRead);
        buffer.insert(buffer.end(), tmp.begin(), tmp.end());
        std::cout << "buffer size: " << buffer.size() << '\n';
    }
}

alternative to using std::getline for boost::iostream::filtering_istream

Question

1 answers

solution1
1 ACCPTED 2023-01-05 20:54:50

alternative to using std::getline for boost::iostream::filtering_istream

Question

1 answers

solution1 1 ACCPTED 2023-01-05 20:54:50

solution1
1 ACCPTED 2023-01-05 20:54:50