简体   繁体   English

正则表达式匹配双引号可选的值

[英]Regex to match the value optionally enclosed by double quotes

I have 3 columns delimited by white spaces but the second field is optionally enclosed by double quotes. 我有3列用空格分隔,但第二个字段可选地用双引号括起来。

I want to extract the 1st field, 2nd field(value within the double quotes) and third field, sometimes the 2nd field value might not be enclosed within the double quotes in that case just return the existing value. 我想提取第一个字段,第二个字段(双引号内的值)和第三个字段,有时第二个字段值可能不包含在双引号中,在这种情况下只返回现有值。

Sample Input 样本输入

1a "2a 2.1a 2.2a" 3a
4b "5.5b 5.6b 5.7b" 6b
7c 8c 9c

Final output 最终输出

Matching Information are 匹配信息是
1st row match 第一排比赛

\1 1a
\2 2a 2.1a 2.2a
\3 3a

2nd row match 第二排比赛

\1 4b
\2 5.5b 5.6b 5.7b
\3 6b

3rd row match 第3排比赛

\1 7c
\2 8c
\3 9c

I tried the below regex and it works fine for the first two inputs but the third line is not matched, Can someone help me to solve this issue? 我尝试了下面的正则表达式,它适用于前两个输入,但第三行不匹配,有人可以帮我解决这个问题吗?

Regex i tried: 我试过的正则表达式:

([a-z0-9]+)\s+"([a-z0-9\s.]+)"\s+([a-z0-9]+)

Link: 链接:

https://regex101.com/r/rN4uB4/1 https://regex101.com/r/rN4uB4/1

You could simply make the quotations optional in your pattern. 您可以简单在模式中选择引号。 By following the preceding token with ? 通过跟随前面的标记? you're telling the regular expression engine to match the preceding between "zero and one" time. 你告诉正则表达式引擎匹配前面的“零和一”时间。

([a-z0-9]+)\s+"?([a-z0-9\s.]+)"?\s+([a-z0-9]+)

If your language supports it, you could use the branch reset feature. 如果您的语言支持,您可以使用分支重置功能。 By using this feature, both capturing groups in the alternatives are considered as one capturing group. 通过使用此功能,备选方案中的两个捕获组都被视为一个捕获组。

([a-z0-9]+)\s+(?|"([^"]+)"|([a-z0-9]+))\s+([a-z0-9]+)

The problem with your regex is that it is is optional that the values are quoted. 正则表达式的问题在于引用值是可选的。

You can parse this using: 您可以使用以下方法解析此:

([a-z0-9]+)\s+"?([a-z0-9\s.]+)"?\s+([a-z0-9]+)

The ? ? means the group (or character " in this case) is optional. 表示组(或"在这种情况下"字符)是可选的。

It makes me however wonder what you want to do? 然而,这让我想知道你想做什么? This looks a lot like bash argument parsing. 这看起来很像bash参数解析。 Sometimes you can take advantage of libraries for this... 有时你可以利用图书馆这个...

EDIT 编辑

@PetSerAl brings up a valid point: both quotes " are independent of each other, so: @PetSerAl提出了一个有效点:两个引号"彼此独立,因此:

4b "5.5b 5.6b 5.7b 6b
4b 5.5b 5.6b 5.7b" 6b

Will match as well, you can solve this by introducing additional capture groups : 也会匹配,你可以通过引入额外的捕获组来解决这个问题:

([a-z0-9]+)\s+("([a-z0-9\s.]+)"|([a-z0-9\s.]+))\s+([a-z0-9]+)

In that case the old capture groups map on the new as follows: 在这种情况下,旧的捕获组映射到新的,如下所示:

  • \\1 -> \\1
  • \\2 -> \\3 (with quotes) or \\4 (without quotes)
  • \\3 -> \\5

One can also use \\2 for the old \\2 , but the new \\2 will include the quotes " as well if they are part of the string. 也可以使用\\2作为旧\\2 ,但是新\\2将包括引号"以及它们是否是字符串的一部分。

It will thus cost more postprocessing to handle them correctly. 因此,它将花费更多的后处理来正确处理它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM