简体   繁体   English

使用C ++正则表达式验证ASCII GnuPlot文件

[英]Validate ASCII GnuPlot file with c++ regex

I have been trying to get this right, but cannot seem to make things work the way I want it to. 我一直在努力做到这一点,但似乎无法使事情按我想要的方式工作。

I have an ASCII file containing several million lines of floating point values, seperated by spaces. 我有一个ASCII文件,其中包含几百万行浮点值,以空格分隔。 Reading these values is straightforward using std::istream_iterator<double> but I wanted to validate the file upfront to make sure it is really formatted the way I described. 使用std::istream_iterator<double>读取这些值很简单,但是我想预先验证文件以确保它确实按照我描述的方式格式化。 Since there is only one correct format, and gazillions of way how it can be illformed, I wanted to go about it using std::regex . 由于只有一种正确的格式,以及如何将其格式化的大量方法,我想使用std::regex

This is what I came up with: 这是我想出的:

std::string begln( "^" );
std::string endln( "$" );
std::string fp( "[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?." );
std::string space( "[[:space:]]{1}" );
std::regex regexp( "(" + begln + fp + space + fp + space + fp + endln + ")+" );

What I wanted to express was: A line consists of something between the beginning and end of the line, which consists of three sets of floating point values seperated with a single space, and I am looking for one or more of these lines. 我要表达的是:一条线由该行的开头和结尾之间的某物组成,它由三组浮点值组成,这些浮点值之间用一个空格分隔,我正在寻找其中的一个或多个。

I would expect a valid datafile to have a single match without prefix and suffix. 我希望有效的数据文件具有不带前缀和后缀的单个匹配项。

But hey, since these values will go into a std::vector<std::array<double, 3>> , why don't I reuse the regex machinery and obtain the values from a match list? 但是,嘿,由于这些值将进入std::vector<std::array<double, 3>> ,为什么不重用正则表达式机制并从匹配列表中获取值? If the file is valid, then an absolutely trivial regex could match just individual lines, and construct a std::sregex_iterator to iterate over the lines. 如果该文件有效,则绝对琐碎的正则表达式可以仅匹配各个行,并构造一个std::sregex_iterator来对这些行进行迭代。 At this point, it is only a matter of obsession how one obtains the values from a singe std::string of a line, whether using regex again or std::stringsteam . 在这一点上,无论是再次使用regex还是std::stringsteam ,如何从一行的单一std::string获取值只是一个std::stringsteam

Why not? 为什么不? The reason why you wouldn't want this is because regex'es are absolute overkill. 之所以不希望这样做,是因为正则表达式绝对是多余的。 They can match far more complex grammars, and are capable of reading in those grammars at runtime. 它们可以匹配更复杂的语法,并且能够在运行时读取这些语法。 That flexibility comes at a high price. 这种灵活性付出了高昂的代价。 All the possible parsers must be included. 必须包括所有可能的解析器。 No current compiler is smart enough to see that you just used [[:space:]] as a regex. 当前没有一个编译器足够聪明,足以看到您仅使用[[:space:]]作为正则表达式。 (In fact, no C++ compiler or linker knows anything about regex - that's purely a library thing). (实际上,没有C ++编译器或链接器对正则表达式一无所知-纯粹是库的事情)。

In comparison, operator>> is overloaded and the compiler sees exactly which overloads you use at compile time. 相比之下, operator>>是重载的,编译器会准确地看到您在编译时使用的重载。 The linker is told this, and includes just the relevant code. 链接器将被告知,并且仅包含相关代码。

Furthermore, the CPU branch predictor will soon notice that operator>> almost always succeeds, which is a further speedup. 此外,CPU分支预测器很快就会注意到operator>>几乎总是成功,这是进一步的加速。 Your regex code is less likely to benefit in the same way - the conditional part in [0-9]* is at least one level of indirection deeper. 您的正则表达式代码不太可能以相同的方式受益- [0-9]*的条件部分至少更深一层间接。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM