[英]R: Regex in strsplit (finding “, ” followed by capital letter)
Say I have a vector containing some characters that I want to split based on a regular expression. 假设我有一个包含一些我希望根据正则表达式拆分的字符的向量。
To be more precise, I want to split the strings based on a comma, followed by a space, and then by a capital letter (to my understanding, the regex
command looks like this: /(, [AZ])/g
(which works fine when I try it here )). 更确切地说,我想基于逗号分隔字符串,然后是空格,然后是大写字母(根据我的理解,
regex
命令看起来像这样: /(, [AZ])/g
(其中当我在这里尝试时工作正常))。
When I try to achieve this in r
, the regex
doesn't seem to work, for example: 当我试图在
r
实现这一点时, regex
似乎不起作用,例如:
x <- c("Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)",
"Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)")
strsplit(x, "/(, [A-Z])/g")
[[1]]
[1] "Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)"
[[2]]
[1] "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)"
It finds no split. 它找不到分裂。 What am I doing wrong here?
我在这做错了什么?
Any help is greatly appreciated! 任何帮助是极大的赞赏!
Here is a solution: 这是一个解决方案:
strsplit(x, ", (?=[A-Z])", perl=T)
See IDEONE demo 请参阅IDEONE演示
Output: 输出:
[[1]]
[1] "Non MMF investment funds"
[2] "Insurance corporations"
[3] "Assets (Net Acquisition of)"
[4] "Loans"
[5] "Long-term original maturity (over 1 year or no stated maturity)"
[[2]]
[1] "Non financial corporations"
[2] "Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds"
[3] "Assets (Net Acquisition of)"
[4] "Loans"
[5] "Short-term original maturity (up to 1 year)"
The regex - ", (?=[AZ])"
- contains a look-ahead (?=[AZ])
that checks but does not consume the uppercase letter. 正则表达式 -
", (?=[AZ])"
- 包含一个前瞻(?=[AZ])
,它检查但不消耗大写字母。 In R, you need to use perl=T
with regexps that contain lookarounds. 在R中,您需要使用
perl=T
和包含lookarounds的regexp。
If the space is optional, or there can be double space between the comma and the uppercase letter, use 如果空格是可选的,或者逗号和大写字母之间可以有双倍空格,请使用
strsplit(x, ",\\s*(?=[A-Z])", perl=T)
And one more variation that will support Unicode letters (with \\\\p{Lu}
): 还有一个支持Unicode字母的变体(使用
\\\\p{Lu}
):
strsplit(x, ", (?=\\p{Lu})", perl=T)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.