简体   繁体   中英

Regular expression to replace a string

I'm working on some code inherited from someone else and trying to understand some regular expression code in C#:

Regex.Replace(query, @"""[^""~]+""([^~]|$)", 
    m => string.Format(field + "_exact:{0}", m.Value))

What is the above regular expression doing? This is in relation to input from a user performing a search. It's doing a replace of the query string using the pattern provided in the second argument, with the value of the third. But what is that regular expression? For the life of me, it doesn't make sense. Thanks.

The @ makes it necessary to escape all the " with a second " , so "" . Without it to escape the " you would have used \\" , but I consider it better to always use @ in regexes, because the \\ is used quite often, and it's boring and unreadable to always have to escape it to \\\\ .

Let's see what the regex really is:

Console.WriteLine(@"""[^""~]+""([^~]|$)");

is

"[^"~]+"([^~]|$)

So now we can look at the "real" regex.

It looks for a " followed by one or more non-" and non-~ followed by another " followed by a non-~ or the end of the string . Note that the match could start after the start of the string and it could end before the end of the string (with a non-~ )

For example in

car"hello"help

it would match "hello"h

As far as I can see, xanatos' answer is correct. I tried to understand the regex, so here it comes:

  "[^"~]+"([^~]|$) 

You can test our regex and play with the single parts for better understanding at http://www.regexpal.com/

1.) a single character

"

The first pattern is a literal character. Since there is no statement of relative position, it can occur everywhere.

2.) a character class

[^"~]

The next expression is the []-bracket. This is a character set. It defines a quantity of characters, which maybe follow next. It is a placeholder for one single character... So lets see inside, which content is allowed:

^"~

The definition of the character class begins with an caret (^), which is a special character. Typing a caret after the opening square bracket will negate the character class. So it's "upside down": everything following, which does not match the class expression, matches and is a valid character.

In this case, every literal character is possible, except the two excluded ones: " or ~.

3.) a special character

+

The next expression, a plus, tells the engine to attempt to match the preceding token once or more. So the defined character class should one or multiple times repeated to match the given expression.

4.) a single character

"

To match, the expression should contain furthermore one further apostrophe, which will be the corresponding apostrophe to the first one in 1.) since the character class in (2.) hence (3.) does not permit an apostrophe.

5.) a lookaround

([^~]|$)

The first structure here to examine is the ()-bracket. This is called a "Lookaround". It is is a special kind of group. Lookaround matches a position. It does not expand the regex match. So this means this part does not try to find any certain characters inside of an expression rather then to localize them.

The localisation demands has two conditions, which are connected by a logical OR by the pipeline symbol: | So the next character of the matched expression could either be [^~] one single character out of the class everything excluding the character ~ or $ the end of the line (or word, if multiline-mode is not used in regex engine)

I'll try to edit my answer to a better format, since this is my first post, I first have to check out how this is working.. :)

Update: to "detect" a Asterisk/star in front/end of the line, you have to do following:

First it's a special character, so you have to escape it with an backslash: *

To define the position, you can use:

  • ^ to look at the beginning of the line,
  • $ end of the line

The overall expression would be:

^* in front of the expression to search for an * at the beginning of the line $* at the end of the regex to demand an * at the end.

.... in your case you can add the * in the last character class to detect an * in the end:

([^~]|$|$*)

and to force an * in the end, delete the other conditions:

($*)

PS: (somehow my regex is swallowed up by formating engine, so my update is wrong...)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM