简体   繁体   English

如何使用带有gzip压缩输入文件的boost :: iostreams :: mapped_file_source

[英]how to use boost::iostreams::mapped_file_source with a gzipped input file

I am using boost::iostreams::mapped_file_source to read a text file from a specific position to a specific position and to manipulate each line (compiled using g++ -Wall -O3 -lboost_iostreams -o test main.cpp): 我使用boost :: iostreams :: mapped_file_source从特定位置读取文本文件到特定位置并操纵每一行(使用g ++ -Wall -O3 -lboost_iostreams -o test main.cpp编译):

#include <iostream>
#include <string>
#include <boost/iostreams/device/mapped_file.hpp>

int main() {
    boost::iostreams::mapped_file_source f_read;
    f_read.open("in.txt");

    long long int alignment_offset(0);

    // set the start point
    const char* pt_current(f_read.data() + alignment_offset);
    // set the end point
    const char* pt_last(f_read.data() + f_read.size());
    const char* pt_current_line_start(pt_current);

    std::string buffer;

    while (pt_current && (pt_current != pt_last)) {
        if ((pt_current = static_cast<const char*>(memchr(pt_current, '\n', pt_last - pt_current)))) {
            buffer.assign(pt_current_line_start, pt_current - pt_current_line_start + 1);
            // do something with buffer

            pt_current++;
            pt_current_line_start = pt_current;
        }
    }

    return 0;
}

Currently, I would like to make this code handle gzip files as well and modify the code like this: 目前,我想让这段代码处理gzip文件并修改代码如下:

#include<iostream>
#include<boost/iostreams/device/mapped_file.hpp>
#include<boost/iostreams/filter/gzip.hpp>
#include<boost/iostreams/filtering_streambuf.hpp>
#include<boost/iostreams/filtering_stream.hpp>
#include<boost/iostreams/stream.hpp>

int main() {
    boost::iostreams::stream<boost::iostreams::mapped_file_source> file;
    file.open(boost::iostreams::mapped_file_source("in.txt.gz"));

    boost::iostreams::filtering_streambuf< boost::iostreams::input > in; 
    in.push(boost::iostreams::gzip_decompressor());
    in.push(file);

    std::istream std_str(&in);
    std::string buffer;
    while(1) {
        std::getline(std_str, buffer);
        if (std_str.eof()) break;
        // do something with buffer
    }   
}   

This code also work well but I don't know how can set the start point (pt_current) and the end point (pt_last) like the first code. 这段代码也运行良好,但我不知道如何设置起点(pt_current)和终点(pt_last)就像第一个代码一样。 Could you let me know how I can set the two values in the second code? 你能告诉我如何在第二个代码中设置这两个值吗?

The answer is no, that's not possible. 答案是否定的,这是不可能的。 The compressed stream would need to have indexes. 压缩流需要有索引。


The real question is Why? 真正的问题是为什么? . You are using a memory mapped file. 您正在使用内存映射文件。 Doing on-the-fly compression/decompression is only going to reduce performance and increase memory consumption. 进行即时压缩/解压缩只会降低性能并增加内存消耗。

If you're not short on actual file storage, then you should probably consider a binary representation, or keep the text as it is. 如果你不是缺乏实际的文件存储空间,那么你应该考虑使用二进制表示法,或保持原文不变。

Binary representation could sidestep most of the complexity involved when using text files with random access. 当使用具有随机访问的文本文件时,二进制表示可以回避所涉及的大部分复杂性。

Some inspirational samples: 一些鼓舞人心的样本:


What you're basically discovering is that text files aren't random access, and compression makes indexing essentially fuzzy (there is no precise mapping from compressed stream offset to uncompressed stream offset). 您基本上发现的是文本文件不是随机访问,压缩使索引基本上模糊(没有从压缩流偏移到未压缩流偏移的精确映射)。

Look at the zran.c example in the zlib distribution as mentioned in the zlib FAQ : 查看zlib常见问题解答中提到的zlib发行版中的zran.c示例:

28. Can I access data randomly in a compressed stream? 28. 我可以在压缩流中随机访问数据吗?

No, not without some preparation. 不,不是没有一些准备。 If when compressing you periodically use Z_FULL_FLUSH , carefully write all the pending data at those points, and keep an index of those locations, then you can start decompression at those points. 如果在定期压缩时使用Z_FULL_FLUSH ,请在这些点上仔细写入所有待处理数据,并保留这些位置的索引,然后您可以在这些点开始解压缩。 You have to be careful to not use Z_FULL_FLUSH too often, since it can significantly degrade compression. 您必须小心不要经常使用Z_FULL_FLUSH ,因为它会显着降低压缩性能。 Alternatively, you can scan a deflate stream once to generate an index, and then use that index for random access. 或者,您可以扫描一次deflate流以生成索引,然后使用该索引进行随机访问。 See examples/zran.c 参见examples/zran.c

¹ you could specifically look at parallel implementations such as eg pbzip2 or pigz; ¹您可以专门查看并行实现,例如pbzip2或pigz; These will necessarily use these "chunks" or "frames" to schedule the load across cores 这些必然会使用这些“块”或“帧”来安排跨核心的负载

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 隐式声明的 boost::iostreams::mapped_file_source 已弃用 - Implicitly-declared boost::iostreams::mapped_file_source is deprecated 使用 boost::iostreams::mapped_file_source 和 std::multimap - Using boost::iostreams::mapped_file_source with std::multimap 未定义的引用`boost :: iostreams :: mapped_file_source :: mapped_file_source()' - undefined reference to `boost::iostreams::mapped_file_source::mapped_file_source()' 使用带有宽字符串的 boost::iostreams::mapped_file_source - Using boost::iostreams::mapped_file_source with wide character strings boost::iostreams::mapped_file_source 打开一个具有 CJK 文件名的文件 - boost::iostreams::mapped_file_source opens a file that has CJK filename 使用CMake对boost :: iostreams :: mapped_file_source :: init()的未定义引用 - undefined reference to boost::iostreams::mapped_file_source::init() using CMake boost :: iostreams :: mapped_file_source :: open在Windows上导致退出代码3,但可在Ubuntu中使用 - boost::iostreams::mapped_file_source::open causes exit code 3 on Windows but works in Ubuntu boost mapping_file_source会抛出哪些异常? - What exceptions does boost mapped_file_source throw? 提高mapd_file_source,对齐方式和页面大小 - Boost mapped_file_source, alignment and page size 由openfilename引起的boost mapping_file_source异常 - boost mapped_file_source exception caused by openfilename
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM