[英]Using strsplit() in R, ignoring anything in parentheses
I'm trying to use strsplit()
in R to break a string into pieces based on commas, but I don't want to split up anything in parentheses. 我正在尝试在R中使用
strsplit()
将字符串根据逗号分隔成多个部分,但我不想在括号中拆分任何内容。 I think the answer is a regex but I'm struggling to get the code right. 我认为答案是正则表达式,但我正在努力使代码正确。
So for example: 因此,例如:
x <- "This is it, isn't it (well, yes)"
> strsplit(x, ", ")
[[1]]
[1] "This is it" "isn't it (well" "yes)"
When what I would like is: 当我想要的是:
[1] "This is it" "isn't it (well, yes)"
We can use PCRE
regex to FAIL
any ,
that follows that a (
before the )
and split by ,
followed by 0 or more space ( \\\\s*
) 我们可以使用
PCRE
正则表达式来FAIL
任何,
随后一个(
前)
和分裂,
接着0或更多空间( \\\\s*
)
strsplit(x, '\\([^)]+,(*SKIP)(*FAIL)|,\\s*', perl=TRUE)[[1]]
#[1] "This is it" "isn't it (well, yes)"
I would suggest another regex with (*SKIP)(*F)
to ignore all the (...)
substrings and only match the commas outside of parenthesized substrings: 我建议另一个带有
(*SKIP)(*F)
正则表达式忽略所有(...)
子字符串,只匹配带括号的子字符串之外的逗号:
x <- "This is it, isn't it (well, yes), and (well, this, that, and this, too)"
strsplit(x, "\\([^()]*\\)(*SKIP)(*F)|\\h*,\\h*", perl=T)
See IDEONE demo 见IDEONE演示
You can read more about How do (*SKIP) or (*F) work on regex? 您可以阅读有关(* SKIP)或(* F)在正则表达式上如何工作的更多信息? here.
这里。 The regex matches:
正则表达式匹配:
\\(
- an opening bracket \\(
-开括号 [^()]*
- zero or more characters other than (
and )
[^()]*
- (
和)
以外的零个或多个字符 \\)
- a closing bracket \\)
-右括号 (*SKIP)(*F)
- the verbs that advance the current regex index to the position after the closing bracket (*SKIP)(*F)
-使当前正则表达式索引前进到右括号后的位置的动词 |
- or... \\\\h*,\\\\h*
- a comma surrounded with zero or more horizontal whitespaces. \\\\h*,\\\\h*
-用零个或多个水平空白包围的逗号。 A different approach: 另一种方法:
Adding on to @Wiktor's sample string, 加上@Wiktor的示例字符串,
x <- "This is it, isn't it (well, yes), and (well, this, that, and this, too). Let's look, does it work?"
Now the magic: 现在魔术:
> strsplit(x, ", |(?>\\(.*?\\).*?\\K(, |$))", perl = TRUE)
[[1]]
[1] "This is it"
[2] "isn't it (well, yes)"
[3] "and (well, this, that, and this, too). Let's look"
[4] "does it work?"
So how does , |(?>\\\\(.*?\\\\).*?\\\\K(, |$))
match? 那么
, |(?>\\\\(.*?\\\\).*?\\\\K(, |$))
匹配?
|
captures either of the groups on either side, both ,
,
(?>\\\\(.*?\\\\).*?\\\\K(, |$))
: (?>\\\\(.*?\\\\).*?\\\\K(, |$))
:
(?> ... )
sets up an atomic group , which does not allow backtracking to reevaluate what it matches. (?> ... )
设置一个原子组 ,该原子组不允许回溯来重新评估其匹配的对象。 \\\\(
), \\\\(
), .
) repeated from 0 to infinity times ( *
), but as few as possible ( ?
), ie .
.
)都会从0到无穷大次数( *
)重复,但应尽可能少( ?
),即.
is evaluated lazily. .
.
repetition is then limited by the first close parenthesis ( \\\\)
), \\\\)
)限制, .*?
) .*?
) \\\\K
at the end, which throws away the match so far and sets the starting point of a new match. \\\\K
,这将扔掉到目前为止的比赛并设置新比赛的起点。 .*?
.*?
is limited by a capturing group ( ( ... )
) with an |
|
的捕获组( ( ... )
)限制|
that either ,
, ,
, \\\\K
to the end of the line, $
, if there are no more commas. \\\\K
移至$
的末尾(如果没有更多逗号)。 *Whew.* *哇*
If my explanation is confusing, see the docs linked above, and check out regex101.com , where you can put in the above regex (single escaped— \\
—instead of R-style double escaped— \\\\
) and a test string to see what it matches and get an explanation of what it's doing. 如果我的解释令人困惑,请参阅上面的文档,并查看regex101.com ,您可以在其中放入上述regex(单转义-
\\
代替R风格的双转义- \\\\
)和测试字符串以查看它匹配什么,并获得它在做什么的解释。 You'll need to set the g
(global) modifier in the box next to the regex box to show all matches and not just the first. 您需要在正则表达式框旁边的框中设置
g
(全局)修饰符,以显示所有匹配项,而不仅仅是第一个。
Happy strsplit
ing! 快乐
strsplit
ING!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.