I have a text of about ~1000000 words (just an example) and I want to parse it in just one iteration . And what I want to do is every time an expression is matched, change it for something else. For example:
The expressions:
| Expression | New text |
|----------------|----------|
| ab | A |
| #{1,3}cb | B |
| [1234567890]db | C |
And the text:
Lorem ipsum dolor sit amet, ##cb consectetur adipiscing elit. Vestibulum posuere ligula in diam volutpat #cb , vitae laoreet ligula ultrices. In quis aliquam urna, in ab suscipit purus. Nunc posuere efficitur 9db nibh, tempor convallis mauris porta lobortis.
Will output the text:
Lorem ipsum dolor sit amet, B consectetur adipiscing elit. Vestibulum posuere ligula in diam volutpat B , vitae laoreet ligula ultrices. In quis aliquam urna, in A suscipit purus. Nunc posuere efficitur C nibh, tempor convallis mauris porta lobortis.
Assuming those "regular" expressions wont overlap with each other, is there a way to implement this in c++ using std::regex
?
I have tried to do so, but there are two main problems:
Is there any way to solve this or the best solution is to program a parser for this specific purpose
As EJP says, this is trivial using (f)lex, even if you don't know anything about parser generators. (You should take a glance at flex's slightly idiosyncratic regular expression syntax , though. As with every regex-oriented tool, it has its own quirks.)
Here's a complete flex program for your task:
%option noinput nounput noyywrap
%%
ab { fputs("A", stdout); }
"#"{1,3}cb { fputs("B", stdout); }
[0-9]db { fputs("C", stdout); }
The last line could have been [1234567890]
, but [0-9]
and [[:digit:]]
are more idiomatic. As illustrated in the second-last line, it's normal to quote special characters by putting them inside quotation marks, something which few regex tools allow. Note that "abc"*
is "any number of repetitions of abc
", which is quite different from abc*
.
The biggest limitation is that there are no captures, although that doesn't seem relevant to your question, and there are almost always simple workarounds.
To compile and run that:
$ flex -o replacer.c replacer.l
$ gcc -o -Wall replacer replacer.c -lfl
$ ./replacer < lorem
Lorem ipsum dolor sit amet, Bconsectetur adipiscing elit. Vestibulum
posuere ligula in diam volutpat B, vitae laoreet ligula ultrices. In
quis aliquam urna, in A suscipit purus. Nunc posuere efficitur C nibh,
tempor convallis mauris porta lobortis.
I've been known to build quick-and-dirty tools which generate, compile and run flex programs, because the result can be a lot faster than cobbling things together from standard utilities when the input files are large. See, for example, this answer on a companion site , which adds an actual main
to the tiny flex program, in order to avoid relying on -lfl
.
I'm not aware how to do this with std::regex but you could try boost::iostreams::regex_filter. The example from their site only uses one filter but I'm sure you could tack on more. The example:
#include <boost/iostreams/device/array.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/regex.hpp>
#include <boost/regex.hpp>
#include <iostream>
using namespace boost::iostreams;
int main()
{
char buffer[16];
array_sink sink{buffer};
filtering_ostream os;
os.push(regex_filter{boost::regex{"Bo+st"}, "C++"});
// push more filters here
os.push(sink);
os << "Boost" << std::flush;
os.pop();
std::cout.write(buffer, 3);
}
The caveat is that according to the docs it will read all of your text into memory at once. Your 1 million words shouldn't be a problem but for gigabyte sized files you could run into issues.
If it were me I would just do the replacement by hand since it's probably not that complex.
Here is what you can do.
Let r1, r2, ..., rn be your regular expressions, and r be their alternation (r1)|(r2)|...|(rn).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.