提升靈氣慢

Question

我嘗試使用 Boost Spirit QI 解析 TPCH 文件。 我的實現靈感來自 Spirit QI 的員工示例（ http://www.boost.org/doc/libs/1_52_0/libs/spirit/example/qi/employee.cpp ）。 數據為 csv 格式，標記以“|”分隔特點。

它可以工作，但速度很慢（1 GB 需要 20 秒）。

這是我的 lineitem 文件的 qi 語法：

struct lineitem {
    int l_orderkey;
    int l_partkey;
    int l_suppkey;
    int l_linenumber;
    std::string l_quantity;
    std::string l_extendedprice;
    std::string l_discount;
    std::string l_tax;
    std::string l_returnflag;
    std::string l_linestatus;
    std::string l_shipdate;
    std::string l_commitdate;
    std::string l_recepitdate;
    std::string l_shipinstruct;
    std::string l_shipmode;
    std::string l_comment;
};

BOOST_FUSION_ADAPT_STRUCT( lineitem,
    (int, l_orderkey)
    (int, l_partkey)
    (int, l_suppkey)
    (int, l_linenumber)
    (std::string, l_quantity)
    (std::string, l_extendedprice)
    (std::string, l_discount)
    (std::string, l_tax)
    (std::string, l_returnflag)
    (std::string, l_linestatus)
    (std::string, l_shipdate)
    (std::string, l_commitdate)
    (std::string, l_recepitdate)
    (std::string, l_shipinstruct)
    (std::string, l_shipmode)
    (std::string, l_comment)) 

vector<lineitem>* lineitems=new vector<lineitem>();

phrase_parse(state->dataPointer,
    state->dataEndPointer,
    (*(int_ >> "|" >>
    int_ >> "|" >> 
    int_ >> "|" >>
    int_ >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' 
    ) ), space, *lineitems
);

問題似乎是字符解析。 它比其他轉換慢得多。 有沒有更好的方法將可變長度的標記解析為字符串？

Answer 1

我找到了解決我的問題的方法。 正如這篇文章中描述的Boost Spirit QI 語法解析分隔字符串很慢，性能瓶頸是 Spirit qi 的字符串處理。 所有其他數據類型似乎都很快。

我通過自己處理數據而不是使用靈氣處理來避免這個問題。

我的解決方案使用一個輔助類，它為 csv 文件的每個字段提供功能。 函數將值存儲到結構中。 字符串存儲在 char[]s 中。 擊中解析器一個換行符，它調用一個將結構添加到結果向量的函數。 Boost 解析器調用這個函數，而不是自己將值存儲到一個向量中。

這是我為 TCPH 基准測試的 region.tbl 文件編寫的代碼：

struct region{
    int r_regionkey;
    char r_name[25];
    char r_comment[152];
};

class regionStorage{
public:
regionStorage(vector<region>* regions) :regions(regions), pos(0) {}
void storer_regionkey(int const&i){
    currentregion.r_regionkey = i;
}

void storer_name(char const&i){
    currentregion.r_name[pos] = i;
    pos++;
}

void storer_comment(char const&i){
    currentregion.r_comment[pos] = i;
    pos++;
}

void resetPos() {
    pos = 0;
}

void endOfLine() {
    pos = 0;
    regions->push_back(currentregion);
}

private:
vector<region>* regions;
region currentregion;
int pos;
};


void parseRegion(){

    vector<region> regions;
    regionStorage regionstorageObject(&regions);
    phrase_parse(dataPointer, /*< start iterator >*/    
     state->dataEndPointer, /*< end iterator >*/
     (*(lexeme[
     +(int_[boost::bind(&regionStorage::storer_regionkey, &regionstorageObject, _1)] - '|') >> '|' >>
     +(char_[boost::bind(&regionStorage::storer_name, &regionstorageObject, _1)] - '|') >> char_('|')[boost::bind(&regionStorage::resetPos, &regionstorageObject)] >>
     +(char_[boost::bind(&regionStorage::storer_comment, &regionstorageObject, _1)] - '|') >> char_('|')[boost::bind(&regionStorage::endOfLine, &regionstorageObject)]
    ])), space);

   cout << regions.size() << endl;
}

這不是一個很好的解決方案，但它有效，而且速度要快得多。 （1 GB TCPH 數據需要 2.2 秒，多線程）

Answer 2

問題主要來自將單個char元素附加到std::string容器。 根據您的語法，對於每個std::string屬性，分配在遇到 char 時開始，並在您找到|時停止| 分隔器。 因此，首先有sizeof(char)+1保留字節（以空字符結尾的“\\0”）。 編譯器必須根據分配器加倍算法運行std::string的分配器！ 這意味着必須非常頻繁地為小字符串重新分配內存。 這意味着您的字符串被復制到內存分配的兩倍大小，並以 1,2,4,6,12,24... 字符的間隔釋放先前的分配。 難怪它很慢，這會導致頻繁的 malloc 調用出現巨大問題； 更多的堆碎片、更大的空閑內存塊鏈表、這些內存塊的可變（小）大小，這反過來又會導致在整個生命周期內為應用程序分配的內存掃描時間更長的問題。 tldr; 數據變得碎片化並廣泛分散在內存中。

證明？ 每次在迭代器中遇到有效字符時， char_parser都會調用以下代碼。 來自 Boost 1.54

/boost/spirit/home/qi/char/char_parser.hpp

if (first != last && this->derived().test(*first, context))
{
    spirit::traits::assign_to(*first, attr_);
    ++first;
    return true;
}
return false;

/boost/spirit/home/qi/detail/assign_to.hpp

// T is not a container and not a string
template <typename T_>
static void call(T_ const& val, Attribute& attr, mpl::false_, mpl::false_)
{
    traits::push_back(attr, val);
}

/boost/spirit/home/support/container.hpp

template <typename Container, typename T, typename Enable/* = void*/>
struct push_back_container
{
    static bool call(Container& c, T const& val)
    {
        c.insert(c.end(), val);
        return true;
    }
};

您發布的更正后續代碼（將您的結構更改為 char Name[Size] ）與添加字符串Name.reserve(Size)語句指令基本相同。 但是，目前沒有關於此的指令。

解決方案：

/boost/spirit/home/support/container.hpp

template <typename Container, typename T, typename Enable/* = void*/>
struct push_back_container
{
    static bool call(Container& c, T const& val, size_t initial_size = 8)
    {
        if (c.capacity() < initial_size)
            c.reserve(initial_size);
        c.insert(c.end(), val);
        return true;
    }
};

/boost/spirit/home/qi/char/char_parser.hpp

if (first != last && this->derived().test(*first, context))
{
    spirit::traits::assign_to(*first, attr_);
    ++first;
    return true;
}
if (traits::is_container<Attribute>::value == true)
    attr_.shrink_to_fit();
return false;

我還沒有測試過它，但我認為它可以像你看到的那樣將字符解析器對字符串屬性的速度提高 10 倍以上。 這將是 Boost Spirit 更新中的一個很好的優化功能，包括一個用於設置初始緩沖區大小的reserve(initial_size)[ +( char_ - lit("|") ) ]指令。

Answer 3

編譯時使用 -O2 嗎？

Boosts 庫有很多冗余，在使用優化標志時被刪除。

另一種可能的解決方案是使用重復解析器指令： http : //www.boost.org/doc/libs/1_52_0/libs/spirit/doc/html/spirit/qi/reference/directive/repeat.html

提升靈氣慢

問題描述

3 個解決方案

解決方案1
5 已采納 2012-11-21 08:51:29

解決方案2
4 2013-09-04 18:41:20

解決方案3
0 2012-11-12 14:36:31

提升靈氣慢

問題描述

3 個解決方案

解決方案1 5 已采納 2012-11-21 08:51:29

解決方案2 4 2013-09-04 18:41:20

解決方案3 0 2012-11-12 14:36:31

解決方案1
5 已采納 2012-11-21 08:51:29

解決方案2
4 2013-09-04 18:41:20

解決方案3
0 2012-11-12 14:36:31