简体   繁体   中英

Regex to match the value optionally enclosed by double quotes

I have 3 columns delimited by white spaces but the second field is optionally enclosed by double quotes.

I want to extract the 1st field, 2nd field(value within the double quotes) and third field, sometimes the 2nd field value might not be enclosed within the double quotes in that case just return the existing value.

Sample Input

1a "2a 2.1a 2.2a" 3a
4b "5.5b 5.6b 5.7b" 6b
7c 8c 9c

Final output

Matching Information are
1st row match

\1 1a
\2 2a 2.1a 2.2a
\3 3a

2nd row match

\1 4b
\2 5.5b 5.6b 5.7b
\3 6b

3rd row match

\1 7c
\2 8c
\3 9c

I tried the below regex and it works fine for the first two inputs but the third line is not matched, Can someone help me to solve this issue?

Regex i tried:

([a-z0-9]+)\s+"([a-z0-9\s.]+)"\s+([a-z0-9]+)

Link:

https://regex101.com/r/rN4uB4/1

You could simply make the quotations optional in your pattern. By following the preceding token with ? you're telling the regular expression engine to match the preceding between "zero and one" time.

([a-z0-9]+)\s+"?([a-z0-9\s.]+)"?\s+([a-z0-9]+)

If your language supports it, you could use the branch reset feature. By using this feature, both capturing groups in the alternatives are considered as one capturing group.

([a-z0-9]+)\s+(?|"([^"]+)"|([a-z0-9]+))\s+([a-z0-9]+)

The problem with your regex is that it is is optional that the values are quoted.

You can parse this using:

([a-z0-9]+)\s+"?([a-z0-9\s.]+)"?\s+([a-z0-9]+)

The ? means the group (or character " in this case) is optional.

It makes me however wonder what you want to do? This looks a lot like bash argument parsing. Sometimes you can take advantage of libraries for this...

EDIT

@PetSerAl brings up a valid point: both quotes " are independent of each other, so:

4b "5.5b 5.6b 5.7b 6b
4b 5.5b 5.6b 5.7b" 6b

Will match as well, you can solve this by introducing additional capture groups :

([a-z0-9]+)\s+("([a-z0-9\s.]+)"|([a-z0-9\s.]+))\s+([a-z0-9]+)

In that case the old capture groups map on the new as follows:

  • \\1 -> \\1
  • \\2 -> \\3 (with quotes) or \\4 (without quotes)
  • \\3 -> \\5

One can also use \\2 for the old \\2 , but the new \\2 will include the quotes " as well if they are part of the string.

It will thus cost more postprocessing to handle them correctly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM