R：strsplit中的正则表达式（找到“，”后跟大写字母）

Question

Say I have a vector containing some characters that I want to split based on a regular expression. 假设我有一个包含一些我希望根据正则表达式拆分的字符的向量。

To be more precise, I want to split the strings based on a comma, followed by a space, and then by a capital letter (to my understanding, the regex command looks like this: /(, [AZ])/g (which works fine when I try it here )). 更确切地说，我想基于逗号分隔字符串，然后是空格，然后是大写字母（根据我的理解， regex命令看起来像这样： /(, [AZ])/g （其中当我在这里尝试时工作正常））。

When I try to achieve this in r , the regex doesn't seem to work, for example: 当我试图在r实现这一点时， regex似乎不起作用，例如：

x <- c("Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)",
  "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)")

strsplit(x, "/(, [A-Z])/g")
[[1]]
[1] "Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)"

It finds no split. 它找不到分裂。 What am I doing wrong here? 我在这做错了什么？

Any help is greatly appreciated! 任何帮助是极大的赞赏！

Answer 1

Here is a solution: 这是一个解决方案：

strsplit(x, ", (?=[A-Z])", perl=T)

See IDEONE demo 请参阅IDEONE演示

Output: 输出：

[[1]]
[1] "Non MMF investment funds"                                       
[2] "Insurance corporations"                                         
[3] "Assets (Net Acquisition of)"                                    
[4] "Loans"                                                          
[5] "Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations"                                                                                
[2] "Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds"
[3] "Assets (Net Acquisition of)"                                                                               
[4] "Loans"                                                                                                     
[5] "Short-term original maturity (up to 1 year)"

The regex - ", (?=[AZ])" - contains a look-ahead (?=[AZ]) that checks but does not consume the uppercase letter. 正则表达式 - ", (?=[AZ])" - 包含一个前瞻(?=[AZ]) ，它检查但不消耗大写字母。 In R, you need to use perl=T with regexps that contain lookarounds. 在R中，您需要使用perl=T和包含lookarounds的regexp。

If the space is optional, or there can be double space between the comma and the uppercase letter, use 如果空格是可选的，或者逗号和大写字母之间可以有双倍空格，请使用

strsplit(x, ",\\s*(?=[A-Z])", perl=T)

And one more variation that will support Unicode letters (with \\\\p{Lu} ): 还有一个支持Unicode字母的变体（使用\\\\p{Lu} ）：

strsplit(x, ", (?=\\p{Lu})", perl=T)

R：strsplit中的正则表达式（找到“，”后跟大写字母）

问题描述

1 个解决方案

解决方案1
8 已采纳 2015-11-17 14:44:43

R：strsplit中的正则表达式（找到“，”后跟大写字母）

问题描述

1 个解决方案

解决方案1 8 已采纳 2015-11-17 14:44:43

解决方案1
8 已采纳 2015-11-17 14:44:43