简体   繁体   中英

Alternatives to Regular Expressions

I have a set of strings with numbers embedded in them. They look something like /cal/long/3/4/145:999 or /pa/metrics/CosmicRay/24:4:bgp:EnergyKurtosis. I'd like to have an expression parser that is

  • Easy to use. Given a few examples someone should be able to form a new expression. I want end users to be able to form new expressions to query this set of strings. Some of the potential users are software engineers, others are testers and some are scientists.
  • Allows for constraints on numbers. Something like '/cal/long/3/4/143:#>100&<1110' to specify that a string prefix with '/cal/long/3/4/143:' and then a number between (100,1110) is expected.
  • Supports '|' and . So the expression '/cal/(long|short)/3/4/ ' would match '/cal/long/3/4/1:2' as well as '/cal/short/3/4/1:2'.
  • Has a Java implementation available or would be easy to implement in Java.

Interesting alternative ideas would be useful. I'm also entertaining the idea of just implementing the subset of regular expressions that I need plus the numerical constraints.

Thanks!

There's no reason to reinvent the wheel! The core of a regular expression engine is built on a strong foundation of mathematics and computer science; the reason we continue to use them today is they are principally sound and won't be improved in the foreseeable future.

If you do find or create some alternative parsing language that only covers a subset of the possibilities Regex can, you will quickly have a user asking for a concept that can be expressed in Regex but your flavor just plain leaves out. Spend your time solving problems that haven't been solved instead!

I'm inclined to agree with Rex M, although your second requirement for numerical constraints complicates things. Unless you only allowed very basic constraints, I'm not aware of a way to succinctly express that in a regular expression. If there is such a way, please disregard the rest of my answer and follow the other suggestions here. :)

You might want to consider a parser generator - things like the classic lex and yacc. I'm not really familiar with the Java choices, but here's a list:

http://java-source.net/open-source/parser-generators

If you're not familiar, the standard approach would be to first create a lexer that turns your strings into tokens. Then you would pass those tokens onto a parser that applies your grammar to them and spits out some kind of result.

In your case, I envision the parser resulting in a combination of a regular expression and additional conditions. For your numerical constraint example, it might give you the regular expression \\/cal/long/3/4/143:(\\d+)\\ and a constraint to apply to the first grouping (the \\d+ portion) that requires that the number lie between 100 and 1100. You'd then apply the RE to your strings for candidates, and apply the constraint to those candidates to find your matches.

It's a pretty complicated approach, so hopefully there's a simpler way. I hope that gives you some ideas, at least.

The Java constraint is a severe one. I would recommend using parsing combinators , but you will have to translate the ideas to Java using classes instead of functions. There are many, many papers available on this topic; one of the easiest to approach is Graham Hutton's Higher-Order Functions for Parsing . Hutton's approach makes it especially easy to decide to succeed or fail based on conditions like the magnitude of a number, as you show in your example.

Unfortunately, not all programmers (myself included) are as familiar with RegEx as they ought be. This often means we end up writing our own string-parsing logic where RegEx could otherwise have served us well.

This isn't always bad. It's possible in some cases to write a DSL (a class, a cohesive set of methods) that's more elegant and readable and meets the precise needs of your problem domain. The trouble is that it can take dozens of iterations to distill the problem into a DSL that is simple and intuitive. And only if the DSL will be used far and wide in the application or by a large community is this trouble warranted. Don't write a elegant solution to a problem that only appears sporadically.

If you're going to go the parser route, check out GOLD Parsing System. It's often a better option than something like YACC, cleaner looking than pure regexes, and supports Java.

http://goldparser.org/about/how-it-works.htm

Actually what you described is the Java Pattern Matcher. Which just happens to use Regex as its language.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM