简体   繁体   English

可靠地使用C ++小字符串优化将短std :: string从文件中读取到内存中

[英]Reliably using C++ Small String Optimization to fread short std::strings from Files into Memory

I have the following class, it contains a data structure called Index, which is expensive to compute. 我有以下类,它包含一个称为Index的数据结构,它的计算成本很高。 So I am caching the index to disk and reading it in again. 因此,我将索引缓存到磁盘并再次读入。 The index element id of template type T can be used with a variety of primitive datatypes. 模板类型T的索引元素id可以与各种原始数据类型一起使用。

But I would also like to use id with the type std::string. 但我也想将id与std :: string类型一起使用。 I wrote the serialize/deserilize code for the general case and also tested if it works with normal C++ strings and they work, if they are short enough. 我为一般情况编写了序列化/反序列化代码,并且还测试了它是否适用于普通的C ++字符串,并且是否足够短(如果它们足够短的话)。 Small string optimization seems to kick in. 小字符串优化似乎开始了。

I also wrote a different implementation just for handling longer strings safely. 我还写了一个不同的实现,只是为了安全地处理更长的字符串。 But the safe code is about 10x slower and I would really like to just read in the strings with fread (500ms readin are very painful, while 50ms are perfectly fine). 但是安全代码的速度慢了大约10倍,我真的很想读入带有fread的字符串(500ms的读取非常痛苦,而50ms的情况则非常好)。

How can I reliably use my libcpp small string optimization, if I know that all identifiers are shorter than the longest possible short string? 如果我知道所有标识符都比最长的短字符串短,那么如何可靠地使用libcpp小字符串优化? How can I reliably tell how long the longest possible small string is? 如何可靠地确定最长的小字符串有多长时间?

template<typename T>
class Reader {
public:
    struct Index {
        T id;
        size_t length;
        // ... values etc
    };

    Index* index;
    size_t indexTableSize;

    void serialize(const char* fileName) {
        FILE *file = fopen(fileName, "w+b");
        if (file == NULL)
            return;

        fwrite(&indexTableSize, sizeof(size_t), 1, file);
        fwrite(index, sizeof(Index), indexTableSize, file);

        fclose(file);
    }

    void deserialize(const char* fileName) {
        FILE *file = fopen(fileName, "rb");
        if (file == NULL)
            return;

        fread(&indexTableSize, sizeof(size_t), 1, file);
        index = new Index[indexTableSize];
        fread(index, sizeof(Index), indexTableSize, file);

        fclose(file);
    }


};

// works perfectly fine
template class Reader<int32_t>;

// works perfectly fine for strings shorter than 22 bytes
template class Reader<std::string>;

std::string is not trivially copyable . std::string是不可复制的 And performing memcpy on a type (which is the equivalent of fwrite ing it and fread ing it back) in C++ is only legal if it is trivially copyable. 并执行memcpy上的类型(这是相当于fwrite荷兰国际集团它和fread在C ++荷兰国际集团回)是唯一的合法,如果它是平凡能够复制。 Therefore, what you want to do is not possible directly. 因此,您想做的事情无法直接实现。

If you want to serialize a string, you must do so manually. 如果要序列化字符串,则必须手动进行。 You must get the number of characters and write it, then write those characters themselves. 您必须先获取字符数并将其写入,然后再自己写这些字符。 To read it back in, you have to read the size of the string, then read that many characters. 要读回它,您必须读取字符串的大小,然后读取那么多字符。

If you want to reliably serialize/deserialize with a type T, you have to make sure that your type T is a POD type (or more precisely standard layout and trivial ). 如果要使用类型T可靠地进行序列化/反序列化,则必须确保类型T是POD类型(或更准确地说是标准布局琐碎的 )。

You can check this in your template by using std::is_trivially_copyable<T> and std::is_standard_layout<T> . 您可以使用std::is_trivially_copyable<T>std::is_standard_layout<T>在模板中进行检查。 Unfortunately, this will fail for std::string . 不幸的是,这对于std::string将失败。

If it's not the case, you must find a proper way to serialize/deserialize the class, ie write/read the data that permit to reconstruct the state of the object (here, the length of the string, and its content). 如果不是这种情况,则必须找到一种适当的方法来对类进行序列化/反序列化,即,写入/读取允许重建对象状态的数据(此处为字符串的长度及其内容)。

Three options: 三种选择:

  • use an auxiliary template that converts T from/to an array of bytes and write a specialisation of this template for each type that may be used for your Reader. 使用一个辅助模板,该辅助模板将T从字节数组转换为字节数组,并为可能用于您的Reader的每种类型编写此模板的特殊化说明。
  • use a member function that does this. 使用执行此操作的成员函数。 But this is not possible for std types. 但这对于std类型是不可能的。
  • use a serialization library, such as for example boost::serialize , s11n or others 使用序列化库,例如boost::serializes11n其他

I would in any case strongly advise you not to rely on non portable properties , such as the length of short strings, especially if you have this code in a template supposed to work with generic types. 无论如何,我都强烈建议您不要依赖非可移植属性 ,例如短字符串的长度,尤其是如果您的模板中的代码应与泛型类型一起使用时,尤其如此。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM