在R中使用strsplit（），忽略括号中的任何内容

Question

I'm trying to use strsplit() in R to break a string into pieces based on commas, but I don't want to split up anything in parentheses. 我正在尝试在R中使用strsplit()将字符串根据逗号分隔成多个部分，但我不想在括号中拆分任何内容。 I think the answer is a regex but I'm struggling to get the code right. 我认为答案是正则表达式，但我正在努力使代码正确。

So for example: 因此，例如：

x <- "This is it, isn't it (well, yes)"
> strsplit(x, ", ")
[[1]]
[1] "This is it"     "isn't it (well" "yes)"

When what I would like is: 当我想要的是：

[1] "This is it"     "isn't it (well, yes)"

Answer 1

We can use PCRE regex to FAIL any , that follows that a ( before the ) and split by , followed by 0 or more space ( \\\\s* ) 我们可以使用PCRE正则表达式来FAIL任何,随后一个(前)和分裂,接着0或更多空间（ \\\\s* ）

 strsplit(x, '\\([^)]+,(*SKIP)(*FAIL)|,\\s*', perl=TRUE)[[1]]
 #[1] "This is it"           "isn't it (well, yes)"

Answer 2

I would suggest another regex with (*SKIP)(*F) to ignore all the (...) substrings and only match the commas outside of parenthesized substrings: 我建议另一个带有(*SKIP)(*F)正则表达式忽略所有(...)子字符串，只匹配带括号的子字符串之外的逗号：

x <- "This is it, isn't it (well, yes), and (well, this, that, and this, too)"
strsplit(x, "\\([^()]*\\)(*SKIP)(*F)|\\h*,\\h*", perl=T)

See IDEONE demo 见IDEONE演示

You can read more about How do (*SKIP) or (*F) work on regex? 您可以阅读有关（* SKIP）或（* F）在正则表达式上如何工作的更多信息？ here. 这里。 The regex matches: 正则表达式匹配：

\\( - an opening bracket \\( -开括号
[^()]* - zero or more characters other than ( and ) [^()]* - (和)以外的零个或多个字符
\\) - a closing bracket \\) -右括号
(*SKIP)(*F) - the verbs that advance the current regex index to the position after the closing bracket (*SKIP)(*F) -使当前正则表达式索引前进到右括号后的位置的动词
| - or... - 要么...
\\\\h*,\\\\h* - a comma surrounded with zero or more horizontal whitespaces. \\\\h*,\\\\h* -用零个或多个水平空白包围的逗号。

Answer 3

A different approach: 另一种方法：

Adding on to @Wiktor's sample string, 加上@Wiktor的示例字符串，

x <- "This is it, isn't it (well, yes), and (well, this, that, and this, too). Let's look, does it work?"

Now the magic: 现在魔术：

> strsplit(x, ", |(?>\\(.*?\\).*?\\K(, |$))", perl = TRUE)
[[1]]
[1] "This is it"                                       
[2] "isn't it (well, yes)"                             
[3] "and (well, this, that, and this, too). Let's look"
[4] "does it work?"

So how does , |(?>\\\$.*?\\\$.*?\\\\K(, |$)) match? 那么, |(?>\\\$.*?\\\$.*?\\\\K(, |$))匹配？

| captures either of the groups on either side, both 捕获双方的任一组
- on the left, the string , 在左侧，字符串,
- and on the right, (?>\\\$.*?\\\$.*?\\\\K(, |$)) : 在右边， (?>\\\$.*?\\\$.*?\\\\K(, |$)) ：
  - (?> ... ) sets up an atomic group , which does not allow backtracking to reevaluate what it matches. (?> ... )设置一个原子组，该原子组不允许回溯来重新评估其匹配的对象。
  - In this case, it looks for an open parenthesis ( \\\$ ), 在这种情况下，它会寻找一个圆括号（ \\\\( ），
  - then any character ( . ) repeated from 0 to infinity times ( * ), but as few as possible ( ? ), ie . 那么任何字符（ . ）都会从0到无穷大次数（ * ）重复，但应尽可能少（ ? ），即. is evaluated lazily. 被懒惰地评估。
  - The previous . 前一个. repetition is then limited by the first close parenthesis ( \\\$ ), 然后，重复由第一个右括号（ \\\\) ）限制，
  - followed by another set of any character repeated 0 to as few as possible ( .*? ) 随后是另一组重复0到尽可能少的任何字符（ .*? ）
  - with a \\\\K at the end, which throws away the match so far and sets the starting point of a new match. 末尾带有\\\\K ，这将扔掉到目前为止的比赛并设置新比赛的起点。
  - The previous .*? 上一个.*? is limited by a capturing group ( ( ... ) ) with an | 由一个带有|的捕获组（ ( ... ) ）限制| that either 要么
    - selects an actual text string, , , 选择一个实际的文本字符串, ，
    - or moves \\\\K to the end of the line, $ , if there are no more commas. 或将\\\\K移至$的末尾（如果没有更多逗号）。

*Whew.* *哇*

If my explanation is confusing, see the docs linked above, and check out regex101.com , where you can put in the above regex (single escaped— \\ —instead of R-style double escaped— \\\\ ) and a test string to see what it matches and get an explanation of what it's doing. 如果我的解释令人困惑，请参阅上面的文档，并查看regex101.com ，您可以在其中放入上述regex（单转义- \\代替R风格的双转义- \\\\ ）和测试字符串以查看它匹配什么，并获得它在做什么的解释。 You'll need to set the g (global) modifier in the box next to the regex box to show all matches and not just the first. 您需要在正则表达式框旁边的框中设置g （全局）修饰符，以显示所有匹配项，而不仅仅是第一个。

Happy strsplit ing! 快乐strsplit ING！

在R中使用strsplit（），忽略括号中的任何内容

问题描述

3 个解决方案

解决方案1
15 2016-02-11 18:52:19

解决方案2
6 2016-02-11 19:18:34

解决方案3
1 2016-02-11 22:31:27

在R中使用strsplit（），忽略括号中的任何内容

问题描述

3 个解决方案

解决方案1 15 2016-02-11 18:52:19

解决方案2 6 2016-02-11 19:18:34

解决方案3 1 2016-02-11 22:31:27

解决方案1
15 2016-02-11 18:52:19

解决方案2
6 2016-02-11 19:18:34

解决方案3
1 2016-02-11 22:31:27