简体   繁体   English

C ++反向正则表达式搜索

[英]C++ backward regex search

I need to build an ultra-efficient log parser (~1GB/s). 我需要构建一个超高效的日志解析器(~1GB / s)。 I implemented Hyperscan library ( https://www.hyperscan.io ) from Intel, and it works well to: 我从英特尔实施了Hyperscan库( https://www.hyperscan.io ),它适用于:

  • count a number of occurence of specified events 计算一些指定事件的发生次数
  • give the end position of the matches 给出比赛的最终位置

One of the limitation is that no capture groups can be reported, only end offsets. 其中一个限制是不能报告捕获组,只能报告终止偏移。 For most matches, I only use the count, but for 10% of them, the match must be parsed to compute further statistics. 对于大多数匹配,我只使用计数,但对于其中10%,必须解析匹配以计算进一步的统计数据。

The challenge is to efficiently run a regex to get the Hyperscan match, knowing only the end offset. 挑战是有效地运行正则表达式以获得Hyperscan匹配,只知道结束偏移。 Currently, I tried: 目前,我尝试过:

string data(const string * block) const {
   std::regex nlexpr("\n(.*)\n$");
   std::smatch match;
   std::regex_search((*block).begin(), (*block).begin() + end, match, nlexpr);
   return match[1];
}
  • block points to the file loaded in memory (2GB, so no copy possible). block指向内存中加载的文件(2GB,因此无法复制)。
  • end is the known offset matching the regex. end是与正则表达式匹配的已知偏移量。

But it is extremely inefficient when the string to match is far in the block. 但是当匹配的字符串远远不够时,效率非常低。 I would have expected the "$" to make the operation very quick as the offset is given as end position, but it is definitely not. 我本来期望“$”使得操作非常快,因为偏移量是作为结束位置给出的,但绝对不是。 The operation take ~1s if end = 100000000 . 如果end = 100000000则操作需要~1s。

It is possible to get the start of the matches from Hyperscan, however performance impact is very high (approximately divided per 2 after testing), so that is not an option. 可以从Hyperscan开始匹配,但性能影响非常高(测试后大约每2分),因此这不是一个选项。

Any idea how to achieve this ? 知道怎么做到这一点? I am using C++ 11 (so std implements the boost regex). 我正在使用C ++ 11(所以std实现了boost regex)。

Best regards 最好的祝福

Edit : As the question came in the comments, I do not have any control over the regexs to be used. 编辑:由于评论中出现了问题,我无法控制要使用的正则表达式。

I have not enough reputation to comment XD. 我没有足够的声誉来评论XD。 I don't see the following as an answer, its more an alternative, nevertheless I have to make an answer, else I won't reach you. 我不认为以下是一个答案,它更像是另一种选择,但我必须回答,否则我不会联系到你。

I guess you won't find a trick to make performance independent of the position (guess its going linear for such simple regex or whatever). 我猜你不会找到一个技巧来使性能独立于位置(猜测它对于这种简单的正则表达式或其他什么是线性的)。

A very simple solution is to replace this horrible regex lib with eg the posix regex.h (old but gold ;) or boost regex. 一个非常简单的解决方案是用例如posix regex.h(旧的但是金色;)或者强制正则表达式替换这个可怕的正则表达式lib。

Here is an example: 这是一个例子:

#include <iostream>
#include <regex>
#include <regex.h>
#include <chrono>
#include <boost/regex.hpp>
inline auto now = std::chrono::steady_clock::now;
inline auto toMs = [](auto &&x){
    return std::chrono::duration_cast<std::chrono::milliseconds>(x).count();
};

void cregex(std::string const&s, std::string const&p)
{
    auto start = now();
    regex_t r;
    regcomp(&r,p.data(),REG_EXTENDED);
    std::vector<regmatch_t> m(r.re_nsub+1);
    regexec(&r,s.data(),m.size(),m.data(),0);
    regfree(&r);
    std::cout << toMs(now()-start) << "ms " << std::string{s.cbegin()+m[1].rm_so,s.cbegin()+m[1].rm_eo} << std::endl;
}

void cxxregex(std::string const&s, std::string const&p)
{
    using namespace std;
    auto start = now();
    regex r(p.data(),regex::extended);
    smatch m;
    regex_search(s.begin(),s.end(),m,r);
    std::cout << toMs(now()-start) << "ms " << m[1] << std::endl;
}
void boostregex(std::string const&s, std::string const&p)
{
    using namespace boost;
    auto start = now();
    regex r(p.data(),regex::extended);
    smatch m;
    regex_search(s.begin(),s.end(),m,r);
    std::cout << toMs(now()-start) << "ms " << m[1] << std::endl;
}

int main()
{
    std::string s(100000000,'x');
    std::string s1 = "yolo" + s;
    std::string s2 = s + "yolo";
    std::cout << "yolo + ... -> cregex "; cregex(s1,"^(yolo)");
    std::cout << "yolo + ... -> cxxregex "; cxxregex(s1,"^(yolo)");
    std::cout << "yolo + ... -> boostregex "; boostregex(s1,"^(yolo)");
    std::cout << "... + yolo -> cregex "; cregex(s2,"(yolo)$");
    std::cout << "... + yolo -> cxxregex "; cxxregex(s2,"(yolo)$");
    std::cout << "... + yolo -> boostregex "; boostregex(s2,"(yolo)$");
}

Gives: 得到:

yolo + ... -> cregex 5ms yolo
yolo + ... -> cxxregex 0ms yolo
yolo + ... -> boostregex 0ms yolo
... + yolo -> cregex 69ms yolo
... + yolo -> cxxregex 2594ms yolo
... + yolo -> boostregex 62ms yolo

I just realized... 我才发现...

That my solutions proposed below does not work. 我下面提出的解决方案不起作用。 Well, at least if there are multiple "yolo" in the text. 好吧,至少如果文中有多个“yolo”。 It does not return the "first instance found in the string", but it returns the "first instance found in a substring of the string". 它不返回“在字符串中找到的第一个实例”,但它返回“在字符串的子字符串中找到的第一个实例”。 So if you have 4 CPUs, the string is split into 4 substrings. 因此,如果您有4个CPU,则该字符串将拆分为4个子字符串。 The first to return "yolo" 'wins'. 第一个返回“yolo”'赢'。 This might be OK if you only want to see if "yolo" is anywhere in the text, but not if you want to get the position of the first instance. 如果您只想查看“yolo”是否在文本中的任何位置,这可能没问题,但如果您想获取第一个实例的位置则不行。

Old answer 老答案

Building on OZ's answer, I've written a parallel version. 基于OZ的答案,我写了一个并行版本。 edit: now using semaphores to finish early. 编辑:现在使用信号量来提前完成。

#include <mutex>
#include <condition_variable>
std::mutex g_mtx;
std::condition_variable g_cv;
int g_found_at = -1;

void thread(
    int id,
    std::string::const_iterator begin,
    std::string::const_iterator end,
    const boost::regex& r,
    boost::smatch* const m)
{
    boost::smatch m_i;
    if (regex_search(begin, end, m_i, r))
    {
        *m = m_i;
        std::unique_lock<std::mutex> lk(g_mtx);
        g_found_at = id;
        lk.unlock();
        g_cv.notify_one();
    }
}
#include <thread>
#include <vector>
#include <memory>
#include <algorithm>
#include <chrono>
using namespace std::chrono_literals;
void boostparregex(std::string const &s, std::string const &p)
{
    {
        std::unique_lock<std::mutex> lk(g_mtx);
        g_found_at = -1;
    }
    auto nrOfCpus = std::thread::hardware_concurrency() / 2;
    std::cout << "(Nr of CPUs: " << nrOfCpus << ") ";
    auto start = steady_clock::now();
    boost::regex r(p.data(), boost::regex::extended);
    std::vector<std::shared_ptr<boost::smatch>> m; m.reserve(nrOfCpus);
    std::generate_n(std::back_inserter(m), nrOfCpus, []() { return std::make_shared<boost::smatch>(); });
    std::vector<std::thread> t; t.reserve(nrOfCpus);
    auto sizePerThread = s.length() / nrOfCpus;
    for (size_t tId = 0; tId < nrOfCpus; tId++) {
        auto begin = s.begin() + (tId * sizePerThread);
        auto end = tId == nrOfCpus - 1 ? s.end() : s.begin() + ((tId + 1) * sizePerThread) - 1;
        t.push_back(std::thread(thread, (int)tId, begin, end, r, m[tId].get()));
    }
    {
        std::unique_lock<std::mutex> lk(g_mtx);
        g_cv.wait_for(lk, 10s, []() { return g_found_at >= 0; });
    }
    {
        std::unique_lock<std::mutex> lk(g_mtx);
        if (g_found_at < 0) std::cout << "Not found! "; else std::cout << m[g_found_at]->str() << " ";
    }
    std::cout << toMs(steady_clock::now() - start) << "ms " << std::endl;
    for (auto& thr : t) thr.join();
}

Which gives me this output (don't have posix under vs2017) 这给了我这个输出(在vs2017下没有posix)

yolo + ... -> cxxregex 0ms yolo
yolo + ... -> boostregex 1ms yolo
yolo + ... -> boostparregex (Nr of CPUs: 4) yolo 13ms
... + yolo -> cxxregex 5014ms yolo
... + yolo -> boostregex 837ms yolo
... + yolo -> boostparregex (Nr of CPUs: 4) yolo 222ms

I get an up to 4 times speedup on 4 CPUs. 我在4个CPU上获得最多4倍的加速。 There is some overhead for starting up the threads 启动线程有一些开销

ps this is my first C++ thread program and first regex, so there could be some optimizations possible. ps这是我的第一个C ++线程程序和第一个正则表达式,所以可能会有一些优化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM