简体   繁体   English

正则表达式的替代方案

[英]Alternatives to Regular Expressions

I have a set of strings with numbers embedded in them. 我有一组字符串,其中嵌入了数字。 They look something like /cal/long/3/4/145:999 or /pa/metrics/CosmicRay/24:4:bgp:EnergyKurtosis. 它们看起来像/ cal / long / 3/4/145:999或/ pa / metrics / CosmicRay / 24:4:bgp:EnergyKurtosis。 I'd like to have an expression parser that is 我想要一个表达式解析器

  • Easy to use. 易于使用。 Given a few examples someone should be able to form a new expression. 举几个例子,某人应该能够形成一个新的表达。 I want end users to be able to form new expressions to query this set of strings. 我希望最终用户能够形成新的表达式来查询这组字符串。 Some of the potential users are software engineers, others are testers and some are scientists. 一些潜在用户是软件工程师,其他人是测试人员,一些是科学家。
  • Allows for constraints on numbers. 允许对数字进行约束。 Something like '/cal/long/3/4/143:#>100&<1110' to specify that a string prefix with '/cal/long/3/4/143:' and then a number between (100,1110) is expected. 类似'/ cal / long / 3/4/143:#> 100&<1110'来指定带有'/ cal / long / 3/4/143:'的字符串前缀,然后是(100,1110)之间的数字是期待。
  • Supports '|' 支持'|' and . So the expression '/cal/(long|short)/3/4/ ' would match '/cal/long/3/4/1:2' as well as '/cal/short/3/4/1:2'. 因此表达式'/ cal /(long | short)/ 3/4 / '将匹配'/ cal / long / 3/4/1:2'以及'/ cal / short / 3/4/1:2 ”。
  • Has a Java implementation available or would be easy to implement in Java. 有Java实现可用或易于在Java中实现。

Interesting alternative ideas would be useful. 有趣的替代想法会很有用。 I'm also entertaining the idea of just implementing the subset of regular expressions that I need plus the numerical constraints. 我也很想要实现我需要的正则表达式子集以及数值约束。

Thanks! 谢谢!

There's no reason to reinvent the wheel! 没有理由重新发明轮子! The core of a regular expression engine is built on a strong foundation of mathematics and computer science; 正则表达式引擎的核心是建立在数学和计算机科学的坚实基础之上; the reason we continue to use them today is they are principally sound and won't be improved in the foreseeable future. 我们今天继续使用它们的原因是它们主要是合理的,并且在可预见的将来不会得到改善。

If you do find or create some alternative parsing language that only covers a subset of the possibilities Regex can, you will quickly have a user asking for a concept that can be expressed in Regex but your flavor just plain leaves out. 如果您确实找到或创建了一些替代解析语言,只涵盖了Regex可以实现的可能性的一部分,那么您很快就会有一个用户要求一个可以在Regex中表达的概念,但您的味道很简单。 Spend your time solving problems that haven't been solved instead! 花时间解决尚未解决的问题!

I'm inclined to agree with Rex M, although your second requirement for numerical constraints complicates things. 我倾向于同意Rex M,尽管你对数值约束的第二个要求使事情复杂化。 Unless you only allowed very basic constraints, I'm not aware of a way to succinctly express that in a regular expression. 除非你只允许非常基本的约束,否则我不知道在正则表达式中简洁地表达它的方法。 If there is such a way, please disregard the rest of my answer and follow the other suggestions here. 如果有这样的方式,请忽略我的其余答案并遵循其他建议。 :) :)

You might want to consider a parser generator - things like the classic lex and yacc. 您可能想要考虑一个解析器生成器 - 比如经典的lex和yacc。 I'm not really familiar with the Java choices, but here's a list: 我对Java的选择并不熟悉,但这里有一个列表:

http://java-source.net/open-source/parser-generators http://java-source.net/open-source/parser-generators

If you're not familiar, the standard approach would be to first create a lexer that turns your strings into tokens. 如果您不熟悉,标准方法是首先创建一个将字符串转换为标记的词法分析器 Then you would pass those tokens onto a parser that applies your grammar to them and spits out some kind of result. 然后你将这些标记传递给一个解析器,将你的语法应用到它们并吐出某种结果。

In your case, I envision the parser resulting in a combination of a regular expression and additional conditions. 在您的情况下,我设想解析器导致正则表达式和其他条件的组合。 For your numerical constraint example, it might give you the regular expression \\/cal/long/3/4/143:(\\d+)\\ and a constraint to apply to the first grouping (the \\d+ portion) that requires that the number lie between 100 and 1100. You'd then apply the RE to your strings for candidates, and apply the constraint to those candidates to find your matches. 对于您的数值约束示例,它可能会为您提供正则表达式\\/cal/long/3/4/143:(\\d+)\\以及要应用于需要该数字的第一个分组( \\d+部分)的约束介于100和1100之间。然后,您可以将RE应用于候选人的字符串,并将约束应用于那些候选人以找到您的匹配项。

It's a pretty complicated approach, so hopefully there's a simpler way. 这是一个非常复杂的方法,所以希望有一种更简单的方法。 I hope that gives you some ideas, at least. 我希望至少能给你一些想法。

The Java constraint is a severe one. Java约束是一个严重的约束。 I would recommend using parsing combinators , but you will have to translate the ideas to Java using classes instead of functions. 我建议使用解析组合器 ,但您必须使用类而不是函数将想法转换为Java。 There are many, many papers available on this topic; 关于这个主题有很多很多论文; one of the easiest to approach is Graham Hutton's Higher-Order Functions for Parsing . 最容易接近的是Graham Hutton的高阶解析函数 Hutton's approach makes it especially easy to decide to succeed or fail based on conditions like the magnitude of a number, as you show in your example. Hutton的方法使得根据数字大小等条件决定成功或失败特别容易,如您在示例中所示。

Unfortunately, not all programmers (myself included) are as familiar with RegEx as they ought be. 不幸的是,并非所有程序员(包括我自己)都熟悉RegEx。 This often means we end up writing our own string-parsing logic where RegEx could otherwise have served us well. 这通常意味着我们最终会编写自己的字符串解析逻辑,否则RegEx可以很好地为我们服务。

This isn't always bad. 这并不总是坏事。 It's possible in some cases to write a DSL (a class, a cohesive set of methods) that's more elegant and readable and meets the precise needs of your problem domain. 在某些情况下,可以编写一个更优雅,更易读并满足问题域精确需求的DSL(一类,一组紧密结合的方法)。 The trouble is that it can take dozens of iterations to distill the problem into a DSL that is simple and intuitive. 麻烦的是,它可能需要数十次迭代才能将问题提炼成简单直观的DSL。 And only if the DSL will be used far and wide in the application or by a large community is this trouble warranted. 只有当DSL在应用程序或大型社区中被广泛使用时才会出现这种麻烦。 Don't write a elegant solution to a problem that only appears sporadically. 不要为只偶尔出现的问题写出优雅的解决方案。

If you're going to go the parser route, check out GOLD Parsing System. 如果您要去解析器路线,请查看GOLD Parsing System。 It's often a better option than something like YACC, cleaner looking than pure regexes, and supports Java. 它通常比YACC更好,比纯正的正则表达式更清晰,并且支持Java。

http://goldparser.org/about/how-it-works.htm http://goldparser.org/about/how-it-works.htm

Actually what you described is the Java Pattern Matcher. 实际上你所描述的是Java 模式匹配器。 Which just happens to use Regex as its language. 恰好使用正则表达式作为其语言。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM