简体   繁体   中英

how to use boost::iostreams::mapped_file_source with a gzipped input file

I am using boost::iostreams::mapped_file_source to read a text file from a specific position to a specific position and to manipulate each line (compiled using g++ -Wall -O3 -lboost_iostreams -o test main.cpp):

#include <iostream>
#include <string>
#include <boost/iostreams/device/mapped_file.hpp>

int main() {
    boost::iostreams::mapped_file_source f_read;
    f_read.open("in.txt");

    long long int alignment_offset(0);

    // set the start point
    const char* pt_current(f_read.data() + alignment_offset);
    // set the end point
    const char* pt_last(f_read.data() + f_read.size());
    const char* pt_current_line_start(pt_current);

    std::string buffer;

    while (pt_current && (pt_current != pt_last)) {
        if ((pt_current = static_cast<const char*>(memchr(pt_current, '\n', pt_last - pt_current)))) {
            buffer.assign(pt_current_line_start, pt_current - pt_current_line_start + 1);
            // do something with buffer

            pt_current++;
            pt_current_line_start = pt_current;
        }
    }

    return 0;
}

Currently, I would like to make this code handle gzip files as well and modify the code like this:

#include<iostream>
#include<boost/iostreams/device/mapped_file.hpp>
#include<boost/iostreams/filter/gzip.hpp>
#include<boost/iostreams/filtering_streambuf.hpp>
#include<boost/iostreams/filtering_stream.hpp>
#include<boost/iostreams/stream.hpp>

int main() {
    boost::iostreams::stream<boost::iostreams::mapped_file_source> file;
    file.open(boost::iostreams::mapped_file_source("in.txt.gz"));

    boost::iostreams::filtering_streambuf< boost::iostreams::input > in; 
    in.push(boost::iostreams::gzip_decompressor());
    in.push(file);

    std::istream std_str(&in);
    std::string buffer;
    while(1) {
        std::getline(std_str, buffer);
        if (std_str.eof()) break;
        // do something with buffer
    }   
}   

This code also work well but I don't know how can set the start point (pt_current) and the end point (pt_last) like the first code. Could you let me know how I can set the two values in the second code?

The answer is no, that's not possible. The compressed stream would need to have indexes.


The real question is Why? . You are using a memory mapped file. Doing on-the-fly compression/decompression is only going to reduce performance and increase memory consumption.

If you're not short on actual file storage, then you should probably consider a binary representation, or keep the text as it is.

Binary representation could sidestep most of the complexity involved when using text files with random access.

Some inspirational samples:


What you're basically discovering is that text files aren't random access, and compression makes indexing essentially fuzzy (there is no precise mapping from compressed stream offset to uncompressed stream offset).

Look at the zran.c example in the zlib distribution as mentioned in the zlib FAQ :

28. Can I access data randomly in a compressed stream?

No, not without some preparation. If when compressing you periodically use Z_FULL_FLUSH , carefully write all the pending data at those points, and keep an index of those locations, then you can start decompression at those points. You have to be careful to not use Z_FULL_FLUSH too often, since it can significantly degrade compression. Alternatively, you can scan a deflate stream once to generate an index, and then use that index for random access. See examples/zran.c

¹ you could specifically look at parallel implementations such as eg pbzip2 or pigz; These will necessarily use these "chunks" or "frames" to schedule the load across cores

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM