解析r中sub和gsub的正则表达式

Question

我在理解以下代码行中的正则表达式的含义时遇到了麻烦。

author = "10_1 A Kumar; Ahmed Hemani ; Johnny &Ouml;berg<"

# after some experiment, it looks like this line captures whatever is in
# front of the underscore.
authodid =  sub("_.*","",author)

# this line extracts the number after the underscore, but I don't know 
# how this is achieved
paperno <- sub(".*_(\\w*)\\s.*", "\\1", author)

# this line extracts the string after the numbers
# I also have no idea how this is achieved through the code
coauthor <- gsub("<","",sub("^.*?\\s","", author))

我已经在网络上阅读到，第一个参数是模式，第二个参数是替换，第三个参数是要操作的对象。 我还看到了几则关于SO的文章，并了解到\\\\w表示一个单词， \\\\s是一个空格。

但是，有些事情仍不清楚。 \\\\w表示单词，是否表示下一个单词？ 如果没有，我应该如何解释？ 我了解到^与字符串的开头匹配，但是^之后的句点呢？

更重要的是， _.*的解释是什么.*_ ^.*?\\\\s怎么样？ 我应该如何阅读它们？

谢谢！

Answer 1

好。 有很多问题。 首先是第一件事。

sub("_.*","",author)查找_以及之后的所有其他内容。 因此，在您的情况下_.*对应于_1 A Kumar; Ahmed Hemani ; Johnny Öberg< _1 A Kumar; Ahmed Hemani ; Johnny Öberg< _1 A Kumar; Ahmed Hemani ; Johnny Öberg< 。 Function sub其用''递归（因此，事实上它会删除它），因此最终得到10 。

sub(".*_(\\\\w*)\\\\s.*", "\\\\1", author)比较棘手（没有任何原因）。 它不提取任何东西。 如果将代码替换为sub(".*_(\\\\w*)\\\\s.*", "222", author) ，结果将为222 （而不是1 ）。 因此，无论您输入第二个参数如何，都将得到结果。 为什么会这样呢？ 好吧，因为".*_(\\\\w*)\\\\s.*"对应于整个字符串，即： .*_对应于10_ ； (\\\\w*)对应于1 ，最后\\\\s.*表示空格及其后的所有内容（因此，字符串的其余部分）。

gsub("<","",sub("^.*?\\\\s","", author))有两个功能。 第一个sub("^.*?\\\\s","", author) 。 从头到尾，它看上去无所不包。 因此^.*?\\\\s代表10_1并删除它。 因此，您最终得到了A Kumar; Ahmed Hemani ; Johnny Öberg< A Kumar; Ahmed Hemani ; Johnny Öberg< A Kumar; Ahmed Hemani ; Johnny Öberg< 。 第二个从各处删除“ <”。

希望对您有所帮助。

解析r中sub和gsub的正则表达式

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-04-02 05:42:15

解析r中sub和gsub的正则表达式

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-04-02 05:42:15

解决方案1
1 已采纳 2017-04-02 05:42:15