简体   繁体   中英

regex not matching when using ? if first character not present

Here is my c# regex:

\"([a-zA-Z0-9]*)\":\"?([a-zA-Z0-9]*)\"?,?}?

I am testing here with sample string:

{"RestrictedCompany": "","SQLServerIndex": 0,"SurveyAdmin": false}`

This is what I think the regex does:

PART 1: Look for the pattern of " ANYTHING ": and store ANYTHING (without the quotes).

PART 2: Then look for a : and store everything until you reach a stop character of either " or , or }

It extracts part 1 fine, but doesnt pick up part 2 at all when the " isnt present (ie when part 2 isnt a string). So I have two questions:

  • Why isn't my current code picking up part 2? (and how can I fix it)
  • is there a way to make the ANYTHING match more flexible? (I tried using \\S but it was too greedy)

First off, don't write your own JSON parser. Use one written by professionals. You're reinventing a rather complex wheel here.

That said, there are also lessons you could learn here about how to write, understand and debug regular expressions, so let's look at that.

Why isn't my current code picking up part 2? (and how can I fix it)

Learn to reason like the regular expression engine.

Let's take a simpler case. We'll take the expression

\"([a-zA-Z0-9]*)\":\"?([a-zA-Z0-9]*)\"?,?}?

And we will search this string:

{"A": "B"}

for an instance of the regular expression.

OK.

  • The { doesn't match anything, so skip it.
  • The first " matches \\" , so maybe we have a match.
  • A matches ([a-zA-Z0-9]*) , so again, maybe we have a match.
  • The second " matches the second \\" , so we're still good.
  • The : matches : ...
  • We now are trying to match \\"? , zero or one quotes. We have , a space. We match zero quotes.
  • We are now trying to match ([a-zA-Z0-9]*) , any number of alphanumerics. We have , a space. Therefore we have zero alphanumerics.
  • We are now trying to again match \\"? , and again we have , so we match zero.
  • We are now trying to match ,? , we have zero of them.
  • We are now trying to match }? , again we have zero of them
  • And we're done. We've successfully matched the pattern, and the match is "A": .
  • Now keep on going; can we match anything in the rest of the string? No. The pattern requires a : , and there is no : in the rest of the string, so I won't labour the point; plainly the match will fail.

If that's not the pattern you wanted to match then write a different pattern. For example, if you want there to be arbitrary whitespace before and after the colon, you probably need a /s* before and after the colon. Also, if you require a value after the : then why did you make everything after the colon optional ? "Required" and "optional" are opposites.


So what's the right thing to do here? Again, the right thing to do is to stop trying to solve this problem with regular expressions and use a json parser like a sensible person. But suppose we did want to parse this with regular expressions. How do we do it?

We do it by breaking the problem down into smaller parts.

What do we really want to match? Let's name each thing we want to match and then write a colon, and then say what the structure of that thing is:

DESIRED : NAME OPTIONAL_WHITESPACE COLON OPTIONAL_WHITESPACE VALUE

OK, break it down. What's a name?

NAME : QUOTE NAMECONTENTS QUOTE

Keep breaking it down.

NAMECONTENTS : any alphanumeric text of any length

Ask yourself is that true? Is an "" a NAME ? Is "1234" a NAME ? Is "$" a NAME ? Refine the pattern until you get it right. We'll go with this for now.

Now here is a hard one:

VALUE : BOOLEAN_LITERAL
VALUE : NUMBER_LITERAL
VALUE : STRING_LITERAL

This can be any of three things. So again, keep breaking it down:

BOOLEAN_LITERAL : true
BOOLEAN_LITERAL : false

Keep going; you can see how to do it from here.

Now make a regular expression for each part and start putting it back together .

  • The regular expression for NAMECONTENTS is \\w* .
  • The regular expression for QUOTE is \\" .
  • Therefore the regular expression for NAME is \\"\\w*\\" .
  • We want to capture the name text so put it in a group: \\"(\\w*)\\"

Great. Similarly:

  • The regular expression for OPTIONAL_WHITESPACE is \\s* .
  • The regular expression for COLON is : .
  • So our regular expression begins \\"(\\w*)\\"\\s:\\s

Now we need to handle VALUE . But we've broken it down. What is the regular expression for BOOLEAN_LITERAL ? That's [true|false] .

Keep going; make a regular expression for the other literals and then build up your regular expression from the leaves to the root .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM