简体   繁体   English

带可选组的正则表达式将多个组包装起来,对于未采用的分支返回未定义

[英]Regular expression with optional group wrapping multiple groups returns undefined for branches not taken

I'm trying to write a regular expression in JavaScript that returns the first quoted or non-quoted word in a string without the quotes (if present). 我正在尝试用JavaScript写一个正则表达式,该表达式返回不带引号(如果存在)的字符串中第一个带引号或未引号的单词。 For example: 例如:

'"quoted phrase" followed by text' => 'quoted phrase'
'phrase without quotes followed by text' => 'phrase'

My regular expression currently is this: (?:"([^"]*)"|([^"\\s]+)) 我的正则表达式当前是这样的:( (?:"([^"]*)"|([^"\\s]+))

However, what I'm noticing is that the output always includes two match groups, one that's always undefined, presumably from the branch that wasn't taken (ie it's the first match if the first word is not quoted, second otherwise). 但是,我要注意的是,输出始终包含两个匹配组,一个始终是未定义的匹配组,大概是未使用的分支(即,如果第一个单词未加引号,则为第一个匹配,否则为第二个)。

What kind of changes can I make to avoid getting the undefined match group and still get the quote-stripping behavior? 为了避免得到undefined匹配组并仍然出现带引号的行为,我可以进行哪些更改?

NOTE: The words are NOT strictly "word-only" (eg alphanumeric) characters. 注意:单词不是严格的“仅单词”(例如字母数字)字符。 They can include non-word characters, just not the " character. 它们可以包含非单词字符,而不能包含"字符。

You need to use ^ (Start anchor) to match the first word and simply use \\w+ to match the word also i think you don't need the main group : 您需要使用^ (开始锚)来匹配第一个单词,并且只需使用\\w+来匹配该单词,我想您也不需要主要组:

"([^"]*)"|(^\w+)

Demo 演示

You are getting extra matches because of the nested groupings you have defined inside your regular expression. 由于您在正则表达式中定义了嵌套分组,因此您获得了额外的匹配项。 The corrected expression should be (?:"[^"]*"|[^"\\s]+) which would produce the following for your inputs (without string quotes) 正确的表达式应该是(?:"[^"]*"|[^"\\s]+) ,它将为您的输入生成以下内容(不带引号)

'"quoted phrase" followed by text' => "quoted phrase"
'phrase without quotes followed by text' => phrase

You can't do what you want using just the regex. 您不能仅使用正则表达式来完成您想做的事情。 Other regex flavors have power features like the Branch Reset Group (which causes capturing groups in each branch to start with the same number): 其他正则表达式版本具有强大的功能,例如“分支重置组”(这会使每个分支中的捕获组以相同的数字开头):

(?|"([^"]*)"|([^"\s]+))

...or they let you use the same name for more than one group: ...或者让您为多个组使用相同的名称:

(?:"(?<token>[^"]*)"|(?<token>[^"\s]+))

...but JavaScript has nothing. ...但是JavaScript一无所有。 Of all the regex flavors associated with programming languages (Perl, Python, Java, etc.), JavaScript is the most lacking in useful features. 在与编程语言(Perl,Python,Java等)相关的所有regex风格中,JavaScript是最缺乏有用功能的。 You just have to go through all the groups and find the one that's not undefined . 您只需要遍历所有组并找到undefined

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM