正则表达式，R和逗号

Question

我在使用R中的正则表达式字符串时遇到了一些问题。我正在尝试使用正则表达式从字符串中提取标记（从Web中删除），如下所示：

str <- "\n\n\n    \n\n\n      “Don't cry because it's over, smile because it happened.”\n    ―\n    Dr. Seuss\n\n\n\n\n   \n     tags:\n       attributed-no-source,\n       cry,\n       crying,\n       experience,\n       happiness,\n       joy,\n       life,\n       misattributed-dr-seuss,\n       optimism,\n       sadness,\n       smile,\n       smiling\n   \n   \n     176513 likes\n   \n\n\n\n\nLike\n\n"

# Why doesn't this work at all?
stringr::str_match(str, "tags:(.+)\\d")

     [,1] [,2]
[1,] NA   NA  

# Why just the first tag? What happens at the comma?
stringr::str_match(str, "tags:\n(.+)")

      [,1]                                  [,2]                          
[1,] "tags:\n       attributed-no-source," "       attributed-no-source,"

所以有两个问题 - 为什么我的第一个想法不起作用，为什么第二个捕获不到字符串的结尾，而不仅仅是第一个逗号？

谢谢！

Answer 1

请注意， stringr正则表达式的风格是ICU的风格。 与TRE不同. 与ICU正则表达式模式中的换行符不匹配。

所以，一个可能的解决方法是使用(?s) - 一个DOTALL修饰符. 匹配任何字符，包括换行符 - 在模式的开头：

str_match(str, "(?s)tags:(.+)\\d")

和

str_match(str, "(?s)tags:\n(.+)")

但是，我觉得你需要获得tags:下面的所有字符串tags:作为单独的匹配。 我建议使用基本R regmatches / gregexpr和PCRE正则表达式

(?:\G(?!\A),?|tags:)\R\h*\K[^\s,]+

查看有关数据的正则表达式演示。

(?:\\G(?!\\A),?|tags:) -匹配以前匹配成功的1或0结束,之后它（ \\G(?!\\A),?或（ | ） tags: substring
\\R - 换行符序列
\\h* - 0+水平空格
\\K - 匹配重置运算符，丢弃目前为止匹配的所有文本
[^\\s,]+ - 除了空白之外的1个或更多个字符,

看R演示：

str <- "\n\n\n    \n\n\n      “Don't cry because it's over, smile because it happened.”\n    ―\n    Dr. Seuss\n\n\n\n\n   \n     tags:\n       attributed-no-source,\n       cry,\n       crying,\n       experience,\n       happiness,\n       joy,\n       life,\n       misattributed-dr-seuss,\n       optimism,\n       sadness,\n       smile,\n       smiling\n   \n   \n     176513 likes\n   \n\n\n\n\nLike\n\n"
reg <- "(?:\\G(?!\\A),?|tags:)\\R\\h*\\K[^\\s,]+"
vals <- regmatches(str, gregexpr(reg, str, perl=TRUE))
unlist(vals)

结果：

[1] "attributed-no-source" "cry" "crying" 
[4] "experience" "happiness" "joy" 
[7] "life" "misattributed-dr-seuss" "optimism" 
[10] "sadness" "smile" "smiling"

正则表达式，R和逗号

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-06-29 18:20:01

正则表达式，R和逗号

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-06-29 18:20:01

解决方案1
3 已采纳 2017-06-29 18:20:01