简体   繁体   中英

Regular Expressions and The Pipe Operator

A little regex help please.

Why are these different?

Regex.Replace("(999) 555-0000 /x ext123", "/x.*|[^0-9]", String.Empty)
"9995550000"


Regex.Replace("(999) 555-0000 /x ext123", "[^0-9]|/x.*", String.Empty)
"9995550000123"

I thought the pipe operator did not care about order... or maybe there is something else that can explain this?

If I took a wild guess, I'd say that it's running the first part of the expression first, and then the second part. So, what's happening in the second case is that it's removing all the non-numeric parts, which means that the second part will never match, and leaves you with the extension intact.

Since it has to run some part of the expression first, since it can't run both at the same time, I'd say this is a fairly natural assumption, though I can see why you might get caught... Definitely an interesting gotcha though.

EDIT: To address the wording, as Ben rightly pointed out, the expression is attempted to be matched starting with each character in the string. So, what happens in the second case is:

  • There is no "^" anchor, so we try at the start of each substring:
  • For "(999) 555-0000 /x ext123" , "(" matches [^0-9] , so replace that with nothing (remove it).
  • For "999) 555-0000 /x ext123" , the "999" part doesn't match [^0-9] , nor does it match /x.* , so we keep trying from the ")" , which matches [^0-9] , so we remove it.
  • And so on. When it gets to the "/" , the same thing happens, it matches [^0-9] and is removed, meaning the second part of the regex can never, ever match.

In the first case, what happens is the following:

  • Again, no "^" anchor, so we try for all substrings:
  • For "(999) 555-0000 /x ext123" , "(" does not match /x.* , but it does match [^0-9] , so replace that with nothing (remove it).
  • For "999) 555-0000 /x ext123" , the "999" part doesn't match /x.* , nor does it match [^0-9] , so we keep trying from the ")" , which doesn't match /x.* , but which matches [^0-9] , so we remove it.
  • When we hit the "/x" , this time /x.* does match, it matches "/x ext123" , and the rest of the string is removed, leaving us with nothing to continue with.

I think you've got the wrong idea about alternation (ie, the pipe). In a pure DFA regex implementation, it's true that alternation favors the longest match no matter how the alternatives are ordered. In other words, the whole regex, whether it contains alternation or not, always returns the earliest and longest possible match--the "leftmost-longest" rule.

However, the regex implementations in most of today's popular programming languages, including .NET, are what Friedl calls Traditional NFA engines. One of the most important differences between them and DFA engines is that alternation is not greedy; it attempts the alternatives in the order they're listed and stops as soon as one of them matches. The only thing that will cause it to change its mind is if the match fails at a later point in the regex, forcing it to backtrack into the alternation.

Note that if you change the [^0-9] to [^0-9]+ in both regexes you'll get the same result from both--but not the one you want. (I'm assuming the /x.* alternative is supposed to match--and remove--the rest of the string, including the extension number.) I'd suggest something like this:

"[^0-9/]+|/x.*$"

That way, neither alternative can even start to match what the other one matches. Not only will that will prevent the kind of confusion you're experiencing, it avoids potential performance bottlenecks. One of the other major differences between DFA's and NFA's is that badly-written NFA's are prone to serious (even catastrophic ) performance problems, and sloppy alternations are one of the easiest ways to trigger them.

The alternation operator ( | ) attempts expressions in the order provided. In the first example, it tries to match the expression /x.* first, then it tries to match [^0-9] .

Because the string " /x ext" can be matched by the first expression [^0-9] , as in the second example, the second part of the expression, /x.* is never invoked.

Billy3

EDIT: More information here on the alternation operator: http://www.regular-expressions.info/alternation.html

You are missing some parenthesis. :)

The problem in your expression is that /x.* matches greedily. In the first expression it is provided first, so the engine attempts to match it first, resulting in matching all of the rest of the string after /x too because of the .* .

If you change it to /x.*? you get the same result as the second expression. The ? after the * tells the regex engine to match non-greedy.

Check http://www.regular-expressions.info/repeat.html to learn more about greediness.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM