解析r中sub和gsub的正則表達式

Question

我在理解以下代碼行中的正則表達式的含義時遇到了麻煩。

author = "10_1 A Kumar; Ahmed Hemani ; Johnny &Ouml;berg<"

# after some experiment, it looks like this line captures whatever is in
# front of the underscore.
authodid =  sub("_.*","",author)

# this line extracts the number after the underscore, but I don't know 
# how this is achieved
paperno <- sub(".*_(\\w*)\\s.*", "\\1", author)

# this line extracts the string after the numbers
# I also have no idea how this is achieved through the code
coauthor <- gsub("<","",sub("^.*?\\s","", author))

我已經在網絡上閱讀到，第一個參數是模式，第二個參數是替換，第三個參數是要操作的對象。 我還看到了幾則關於SO的文章，並了解到\\\\w表示一個單詞， \\\\s是一個空格。

但是，有些事情仍不清楚。 \\\\w表示單詞，是否表示下一個單詞？ 如果沒有，我應該如何解釋？ 我了解到^與字符串的開頭匹配，但是^之后的句點呢？

更重要的是， _.*的解釋是什么.*_ ^.*?\\\\s怎么樣？ 我應該如何閱讀它們？

謝謝！

Answer 1

好。 有很多問題。 首先是第一件事。

sub("_.*","",author)查找_以及之后的所有其他內容。 因此，在您的情況下_.*對應於_1 A Kumar; Ahmed Hemani ; Johnny Öberg< _1 A Kumar; Ahmed Hemani ; Johnny Öberg< _1 A Kumar; Ahmed Hemani ; Johnny Öberg< 。 Function sub其用''遞歸（因此，事實上它會刪除它），因此最終得到10 。

sub(".*_(\\\\w*)\\\\s.*", "\\\\1", author)比較棘手（沒有任何原因）。 它不提取任何東西。 如果將代碼替換為sub(".*_(\\\\w*)\\\\s.*", "222", author) ，結果將為222 （而不是1 ）。 因此，無論您輸入第二個參數如何，都將得到結果。 為什么會這樣呢？ 好吧，因為".*_(\\\\w*)\\\\s.*"對應於整個字符串，即： .*_對應於10_ ； (\\\\w*)對應於1 ，最后\\\\s.*表示空格及其后的所有內容（因此，字符串的其余部分）。

gsub("<","",sub("^.*?\\\\s","", author))有兩個功能。 第一個sub("^.*?\\\\s","", author) 。 從頭到尾，它看上去無所不包。 因此^.*?\\\\s代表10_1並刪除它。 因此，您最終得到了A Kumar; Ahmed Hemani ; Johnny Öberg< A Kumar; Ahmed Hemani ; Johnny Öberg< A Kumar; Ahmed Hemani ; Johnny Öberg< 。 第二個從各處刪除“ <”。

希望對您有所幫助。

解析r中sub和gsub的正則表達式

問題描述

1 個解決方案

解決方案1
1 已采納 2017-04-02 05:42:15

解析r中sub和gsub的正則表達式

問題描述

1 個解決方案

解決方案1 1 已采納 2017-04-02 05:42:15

解決方案1
1 已采納 2017-04-02 05:42:15