简体   繁体   中英

C++ backward regex search

I need to build an ultra-efficient log parser (~1GB/s). I implemented Hyperscan library ( https://www.hyperscan.io ) from Intel, and it works well to:

  • count a number of occurence of specified events
  • give the end position of the matches

One of the limitation is that no capture groups can be reported, only end offsets. For most matches, I only use the count, but for 10% of them, the match must be parsed to compute further statistics.

The challenge is to efficiently run a regex to get the Hyperscan match, knowing only the end offset. Currently, I tried:

string data(const string * block) const {
   std::regex nlexpr("\n(.*)\n$");
   std::smatch match;
   std::regex_search((*block).begin(), (*block).begin() + end, match, nlexpr);
   return match[1];
}
  • block points to the file loaded in memory (2GB, so no copy possible).
  • end is the known offset matching the regex.

But it is extremely inefficient when the string to match is far in the block. I would have expected the "$" to make the operation very quick as the offset is given as end position, but it is definitely not. The operation take ~1s if end = 100000000 .

It is possible to get the start of the matches from Hyperscan, however performance impact is very high (approximately divided per 2 after testing), so that is not an option.

Any idea how to achieve this ? I am using C++ 11 (so std implements the boost regex).

Best regards

Edit : As the question came in the comments, I do not have any control over the regexs to be used.

I have not enough reputation to comment XD. I don't see the following as an answer, its more an alternative, nevertheless I have to make an answer, else I won't reach you.

I guess you won't find a trick to make performance independent of the position (guess its going linear for such simple regex or whatever).

A very simple solution is to replace this horrible regex lib with eg the posix regex.h (old but gold ;) or boost regex.

Here is an example:

#include <iostream>
#include <regex>
#include <regex.h>
#include <chrono>
#include <boost/regex.hpp>
inline auto now = std::chrono::steady_clock::now;
inline auto toMs = [](auto &&x){
    return std::chrono::duration_cast<std::chrono::milliseconds>(x).count();
};

void cregex(std::string const&s, std::string const&p)
{
    auto start = now();
    regex_t r;
    regcomp(&r,p.data(),REG_EXTENDED);
    std::vector<regmatch_t> m(r.re_nsub+1);
    regexec(&r,s.data(),m.size(),m.data(),0);
    regfree(&r);
    std::cout << toMs(now()-start) << "ms " << std::string{s.cbegin()+m[1].rm_so,s.cbegin()+m[1].rm_eo} << std::endl;
}

void cxxregex(std::string const&s, std::string const&p)
{
    using namespace std;
    auto start = now();
    regex r(p.data(),regex::extended);
    smatch m;
    regex_search(s.begin(),s.end(),m,r);
    std::cout << toMs(now()-start) << "ms " << m[1] << std::endl;
}
void boostregex(std::string const&s, std::string const&p)
{
    using namespace boost;
    auto start = now();
    regex r(p.data(),regex::extended);
    smatch m;
    regex_search(s.begin(),s.end(),m,r);
    std::cout << toMs(now()-start) << "ms " << m[1] << std::endl;
}

int main()
{
    std::string s(100000000,'x');
    std::string s1 = "yolo" + s;
    std::string s2 = s + "yolo";
    std::cout << "yolo + ... -> cregex "; cregex(s1,"^(yolo)");
    std::cout << "yolo + ... -> cxxregex "; cxxregex(s1,"^(yolo)");
    std::cout << "yolo + ... -> boostregex "; boostregex(s1,"^(yolo)");
    std::cout << "... + yolo -> cregex "; cregex(s2,"(yolo)$");
    std::cout << "... + yolo -> cxxregex "; cxxregex(s2,"(yolo)$");
    std::cout << "... + yolo -> boostregex "; boostregex(s2,"(yolo)$");
}

Gives:

yolo + ... -> cregex 5ms yolo
yolo + ... -> cxxregex 0ms yolo
yolo + ... -> boostregex 0ms yolo
... + yolo -> cregex 69ms yolo
... + yolo -> cxxregex 2594ms yolo
... + yolo -> boostregex 62ms yolo

I just realized...

That my solutions proposed below does not work. Well, at least if there are multiple "yolo" in the text. It does not return the "first instance found in the string", but it returns the "first instance found in a substring of the string". So if you have 4 CPUs, the string is split into 4 substrings. The first to return "yolo" 'wins'. This might be OK if you only want to see if "yolo" is anywhere in the text, but not if you want to get the position of the first instance.

Old answer

Building on OZ's answer, I've written a parallel version. edit: now using semaphores to finish early.

#include <mutex>
#include <condition_variable>
std::mutex g_mtx;
std::condition_variable g_cv;
int g_found_at = -1;

void thread(
    int id,
    std::string::const_iterator begin,
    std::string::const_iterator end,
    const boost::regex& r,
    boost::smatch* const m)
{
    boost::smatch m_i;
    if (regex_search(begin, end, m_i, r))
    {
        *m = m_i;
        std::unique_lock<std::mutex> lk(g_mtx);
        g_found_at = id;
        lk.unlock();
        g_cv.notify_one();
    }
}
#include <thread>
#include <vector>
#include <memory>
#include <algorithm>
#include <chrono>
using namespace std::chrono_literals;
void boostparregex(std::string const &s, std::string const &p)
{
    {
        std::unique_lock<std::mutex> lk(g_mtx);
        g_found_at = -1;
    }
    auto nrOfCpus = std::thread::hardware_concurrency() / 2;
    std::cout << "(Nr of CPUs: " << nrOfCpus << ") ";
    auto start = steady_clock::now();
    boost::regex r(p.data(), boost::regex::extended);
    std::vector<std::shared_ptr<boost::smatch>> m; m.reserve(nrOfCpus);
    std::generate_n(std::back_inserter(m), nrOfCpus, []() { return std::make_shared<boost::smatch>(); });
    std::vector<std::thread> t; t.reserve(nrOfCpus);
    auto sizePerThread = s.length() / nrOfCpus;
    for (size_t tId = 0; tId < nrOfCpus; tId++) {
        auto begin = s.begin() + (tId * sizePerThread);
        auto end = tId == nrOfCpus - 1 ? s.end() : s.begin() + ((tId + 1) * sizePerThread) - 1;
        t.push_back(std::thread(thread, (int)tId, begin, end, r, m[tId].get()));
    }
    {
        std::unique_lock<std::mutex> lk(g_mtx);
        g_cv.wait_for(lk, 10s, []() { return g_found_at >= 0; });
    }
    {
        std::unique_lock<std::mutex> lk(g_mtx);
        if (g_found_at < 0) std::cout << "Not found! "; else std::cout << m[g_found_at]->str() << " ";
    }
    std::cout << toMs(steady_clock::now() - start) << "ms " << std::endl;
    for (auto& thr : t) thr.join();
}

Which gives me this output (don't have posix under vs2017)

yolo + ... -> cxxregex 0ms yolo
yolo + ... -> boostregex 1ms yolo
yolo + ... -> boostparregex (Nr of CPUs: 4) yolo 13ms
... + yolo -> cxxregex 5014ms yolo
... + yolo -> boostregex 837ms yolo
... + yolo -> boostparregex (Nr of CPUs: 4) yolo 222ms

I get an up to 4 times speedup on 4 CPUs. There is some overhead for starting up the threads

ps this is my first C++ thread program and first regex, so there could be some optimizations possible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM