R：strsplit中的正則表達式（找到“，”后跟大寫字母）

Question

假設我有一個包含一些我希望根據正則表達式拆分的字符的向量。

更確切地說，我想基於逗號分隔字符串，然后是空格，然后是大寫字母（根據我的理解， regex命令看起來像這樣： /(, [AZ])/g （其中當我在這里嘗試時工作正常））。

當我試圖在r實現這一點時， regex似乎不起作用，例如：

x <- c("Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)",
  "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)")

strsplit(x, "/(, [A-Z])/g")
[[1]]
[1] "Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)"

它找不到分裂。 我在這做錯了什么？

任何幫助是極大的贊賞！

Answer 1

這是一個解決方案：

strsplit(x, ", (?=[A-Z])", perl=T)

請參閱IDEONE演示

輸出：

[[1]]
[1] "Non MMF investment funds"                                       
[2] "Insurance corporations"                                         
[3] "Assets (Net Acquisition of)"                                    
[4] "Loans"                                                          
[5] "Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations"                                                                                
[2] "Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds"
[3] "Assets (Net Acquisition of)"                                                                               
[4] "Loans"                                                                                                     
[5] "Short-term original maturity (up to 1 year)"

正則表達式 - ", (?=[AZ])" - 包含一個前瞻(?=[AZ]) ，它檢查但不消耗大寫字母。 在R中，您需要使用perl=T和包含lookarounds的regexp。

如果空格是可選的，或者逗號和大寫字母之間可以有雙倍空格，請使用

strsplit(x, ",\\s*(?=[A-Z])", perl=T)

還有一個支持Unicode字母的變體（使用\\\\p{Lu} ）：

strsplit(x, ", (?=\\p{Lu})", perl=T)

R：strsplit中的正則表達式（找到“，”后跟大寫字母）

問題描述

1 個解決方案

解決方案1
8 已采納 2015-11-17 14:44:43

R：strsplit中的正則表達式（找到“，”后跟大寫字母）

問題描述

1 個解決方案

解決方案1 8 已采納 2015-11-17 14:44:43

解決方案1
8 已采納 2015-11-17 14:44:43