简体   繁体   中英

Are regular expression adequate for analyzing large text?

I have a text of about ~1000000 words (just an example) and I want to parse it in just one iteration . And what I want to do is every time an expression is matched, change it for something else. For example:

The expressions:

| Expression     | New text |
|----------------|----------|
| ab             | A        |
| #{1,3}cb       | B        |
| [1234567890]db | C        |

And the text:

Lorem ipsum dolor sit amet, ##cb consectetur adipiscing elit. Vestibulum posuere ligula in diam volutpat #cb , vitae laoreet ligula ultrices. In quis aliquam urna, in ab suscipit purus. Nunc posuere efficitur 9db nibh, tempor convallis mauris porta lobortis.

Will output the text:

Lorem ipsum dolor sit amet, B consectetur adipiscing elit. Vestibulum posuere ligula in diam volutpat B , vitae laoreet ligula ultrices. In quis aliquam urna, in A suscipit purus. Nunc posuere efficitur C nibh, tempor convallis mauris porta lobortis.

Assuming those "regular" expressions wont overlap with each other, is there a way to implement this in c++ using std::regex ?

I have tried to do so, but there are two main problems:

  • It will only match the very first and the very last expression.
  • It will only will iterate over the text once for each expression.

Is there any way to solve this or the best solution is to program a parser for this specific purpose

As EJP says, this is trivial using (f)lex, even if you don't know anything about parser generators. (You should take a glance at flex's slightly idiosyncratic regular expression syntax , though. As with every regex-oriented tool, it has its own quirks.)

Here's a complete flex program for your task:

file replacer.l

%option noinput nounput noyywrap
%%
ab         { fputs("A", stdout); }
"#"{1,3}cb { fputs("B", stdout); }
[0-9]db    { fputs("C", stdout); }

The last line could have been [1234567890] , but [0-9] and [[:digit:]] are more idiomatic. As illustrated in the second-last line, it's normal to quote special characters by putting them inside quotation marks, something which few regex tools allow. Note that "abc"* is "any number of repetitions of abc ", which is quite different from abc* .

The biggest limitation is that there are no captures, although that doesn't seem relevant to your question, and there are almost always simple workarounds.

To compile and run that:

$ flex -o replacer.c replacer.l
$ gcc -o -Wall replacer replacer.c -lfl

$ ./replacer < lorem
Lorem ipsum dolor sit amet, Bconsectetur adipiscing elit. Vestibulum
posuere ligula in diam volutpat B, vitae laoreet ligula ultrices. In
quis aliquam urna, in A suscipit purus. Nunc posuere efficitur C nibh,
tempor convallis mauris porta lobortis.

I've been known to build quick-and-dirty tools which generate, compile and run flex programs, because the result can be a lot faster than cobbling things together from standard utilities when the input files are large. See, for example, this answer on a companion site , which adds an actual main to the tiny flex program, in order to avoid relying on -lfl .

I'm not aware how to do this with std::regex but you could try boost::iostreams::regex_filter. The example from their site only uses one filter but I'm sure you could tack on more. The example:

#include <boost/iostreams/device/array.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/regex.hpp>
#include <boost/regex.hpp>
#include <iostream>

using namespace boost::iostreams;

int main()
{
  char buffer[16];
  array_sink sink{buffer};
  filtering_ostream os;
  os.push(regex_filter{boost::regex{"Bo+st"}, "C++"});
  // push more filters here
  os.push(sink);
  os << "Boost" << std::flush;
  os.pop();
  std::cout.write(buffer, 3);
}

The caveat is that according to the docs it will read all of your text into memory at once. Your 1 million words shouldn't be a problem but for gigabyte sized files you could run into issues.

If it were me I would just do the replacement by hand since it's probably not that complex.

Here is what you can do.

Let r1, r2, ..., rn be your regular expressions, and r be their alternation (r1)|(r2)|...|(rn).

  1. Find the first occurence of r.
  2. Compare the occurence separately with r1, r2, ..., rn. When found a match, write to the destination text everything before the first occurence, and the replacement.
  3. Repeat with the rest of the text.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM