简体   繁体   English

在R中使用strsplit(),忽略括号中的任何内容

[英]Using strsplit() in R, ignoring anything in parentheses

I'm trying to use strsplit() in R to break a string into pieces based on commas, but I don't want to split up anything in parentheses. 我正在尝试在R中使用strsplit()将字符串根据逗号分隔成多个部分,但我不想在括号中拆分任何内容。 I think the answer is a regex but I'm struggling to get the code right. 我认为答案是正则表达式,但我正在努力使代码正确。

So for example: 因此,例如:

x <- "This is it, isn't it (well, yes)"
> strsplit(x, ", ")
[[1]]
[1] "This is it"     "isn't it (well" "yes)" 

When what I would like is: 当我想要的是:

[1] "This is it"     "isn't it (well, yes)"

We can use PCRE regex to FAIL any , that follows that a ( before the ) and split by , followed by 0 or more space ( \\\\s* ) 我们可以使用PCRE正则表达式来FAIL任何,随后一个()和分裂,接着0或更多空间( \\\\s*

 strsplit(x, '\\([^)]+,(*SKIP)(*FAIL)|,\\s*', perl=TRUE)[[1]]
 #[1] "This is it"           "isn't it (well, yes)"

I would suggest another regex with (*SKIP)(*F) to ignore all the (...) substrings and only match the commas outside of parenthesized substrings: 我建议另一个带有(*SKIP)(*F)正则表达式忽略所有(...)子字符串,只匹配带括号的子字符串之外的逗号:

x <- "This is it, isn't it (well, yes), and (well, this, that, and this, too)"
strsplit(x, "\\([^()]*\\)(*SKIP)(*F)|\\h*,\\h*", perl=T)

See IDEONE demo IDEONE演示

You can read more about How do (*SKIP) or (*F) work on regex? 您可以阅读有关(* SKIP)或(* F)在正则表达式上如何工作的更多信息? here. 这里。 The regex matches: 正则表达式匹配:

  • \\( - an opening bracket \\( -开括号
  • [^()]* - zero or more characters other than ( and ) [^()]* - ()以外的零个或多个字符
  • \\) - a closing bracket \\) -右括号
  • (*SKIP)(*F) - the verbs that advance the current regex index to the position after the closing bracket (*SKIP)(*F) -使当前正则表达式索引前进到右括号后的位置的动词
  • | - or... - 要么...
  • \\\\h*,\\\\h* - a comma surrounded with zero or more horizontal whitespaces. \\\\h*,\\\\h* -用零个或多个水平空白包围的逗号。

A different approach: 另一种方法:

Adding on to @Wiktor's sample string, 加上@Wiktor的示例字符串,

x <- "This is it, isn't it (well, yes), and (well, this, that, and this, too). Let's look, does it work?"

Now the magic: 现在魔术:

> strsplit(x, ", |(?>\\(.*?\\).*?\\K(, |$))", perl = TRUE)
[[1]]
[1] "This is it"                                       
[2] "isn't it (well, yes)"                             
[3] "and (well, this, that, and this, too). Let's look"
[4] "does it work?"  

So how does , |(?>\\\\(.*?\\\\).*?\\\\K(, |$)) match? 那么, |(?>\\\\(.*?\\\\).*?\\\\K(, |$))匹配?

  • | captures either of the groups on either side, both 捕获双方的任一组
    • on the left, the string , 在左侧,字符串,
    • and on the right, (?>\\\\(.*?\\\\).*?\\\\K(, |$)) : 在右边, (?>\\\\(.*?\\\\).*?\\\\K(, |$))
      • (?> ... ) sets up an atomic group , which does not allow backtracking to reevaluate what it matches. (?> ... )设置一个原子组 ,该原子组不允许回溯来重新评估其匹配的对象。
      • In this case, it looks for an open parenthesis ( \\\\( ), 在这种情况下,它会寻找一个圆括号( \\\\( ),
      • then any character ( . ) repeated from 0 to infinity times ( * ), but as few as possible ( ? ), ie . 那么任何字符( . )都会从0到无穷大次数( * )重复,但应尽可能少( ? ),即. is evaluated lazily. 被懒惰地评估。
      • The previous . 前一个. repetition is then limited by the first close parenthesis ( \\\\) ), 然后,重复由第一个右括号( \\\\) )限制,
      • followed by another set of any character repeated 0 to as few as possible ( .*? ) 随后是另一组重复0到尽可能少的任何字符( .*?
      • with a \\\\K at the end, which throws away the match so far and sets the starting point of a new match. 末尾带有\\\\K ,这将扔掉到目前为止的比赛并设置新比赛的起点。
      • The previous .*? 上一个.*? is limited by a capturing group ( ( ... ) ) with an | 由一个带有|的捕获组( ( ... ) )限制| that either 要么
        • selects an actual text string, , , 选择一个实际的文本字符串,
        • or moves \\\\K to the end of the line, $ , if there are no more commas. 或将\\\\K移至$的末尾(如果没有更多逗号)。

*Whew.* *哇*

If my explanation is confusing, see the docs linked above, and check out regex101.com , where you can put in the above regex (single escaped— \\ —instead of R-style double escaped— \\\\ ) and a test string to see what it matches and get an explanation of what it's doing. 如果我的解释令人困惑,请参阅上面的文档,并查看regex101.com ,您可以在其中放入上述regex(单转义- \\代替R风格的双转义- \\\\ )和测试字符串以查看它匹配什么,并获得它在做什么的解释。 You'll need to set the g (global) modifier in the box next to the regex box to show all matches and not just the first. 您需要在正则表达式框旁边的框中设置g (全局)修饰符,以显示所有匹配项,而不仅仅是第一个。

Happy strsplit ing! 快乐strsplit ING!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM