如何在R中有效地比較兩個不同的列（都包含字符串）？

Question

假設A是一個數據幀，並且A的結構如下

Row no                              C1                                 C2                  
1                               I am fine                             1234
2                               He is fine                            1234
3                               am better                             1234
4                               better butter                         1234
5                                fine good                            1234
6                               good to be better                     1234

B是另一個數據幀

Row no                           C1                                                  
1                               fine                             
2                               good

我想將A $ C1與B $ C1進行比較，B $ C1中的字符串應包含在A $ C1中。 因此，當我將A $ C1與B $ C1進行比較時，結果將是A中包含B字符串的行號。 對於上述情況，輸出為1、2、5、6，因為1,2,5包含單詞“ fine”，而6包含單詞“ good”。 我不想將“好”與A的第5行進行比較，因為我已經選擇了第5行。我想要一種有效的解決方案，因為我的真實數據（A）集的行數大約為400000，而B大約為10000

Answer 1

grep可以為您完成這項工作：

grep(paste(B$C1, collapse="|"), A$C1)
1 2 5 6

上面的代碼使您在A$C1中包含至少一個單詞B$C1 ，即行1、2、5和6。第一個參數是一個正則表達式，這就是為什么我們用"|" （表示“或”）。

而且似乎可擴展。 使用100.000個示例短語（來自您的短語）和兩個單詞進行基准測試， grep僅需0.076秒。

Answer 2

該功能

phrasesWithWords <- function(x, table)
{
    words <- strsplit(x, "\\W")
    found <- relist(unlist(words) %in% table, words)
    which(sapply(found, any))
}

適用於您的短語和可接受的單詞表：

phrase <- c("I am fine", "He is fine", "am better", "better butter",
            "fine good", "good to be better")
table <- c("fine", "good")
phrasesWithWords(phrase, table)

該函數的工作原理是將詞組分解為單詞，然后在表中查找每個詞（不循環查看短語的長列表），重新列出邏輯向量，並詢問哪些列表元素包含至少一個TRUE。

事實證明，與簡單的grep解決方案相比，它效率不高

f1 <- function(x, table)
    grep(paste(table, collapse="|"), x)

與

library(microbenchmark)
x1000 <- rep(x, 1000)

給予

> microbenchmark(phrasesWithWords(x1000, table), f1(x1000, table),
+                times=5)
Unit: milliseconds
                           expr        min         lq     median         uq
 phrasesWithWords(x1000, table) 130.167172 132.815303 133.011161 133.112888
               f1(x1000, table)   2.959576   2.973416   2.990412   3.060494
        max neval
 134.504282     5
   3.439293     5

漂亮的整潔軟件包“ lineprof”顯示了修改后的功能

f0 <- function(x, table)
{
    words <- strsplit(x, "\\W")
    idx <- unlist(words) %in% table
    found <- relist(idx, words)
    which(sapply(found, any))
}

主要瓶頸在重新列出

> lineprof(f0(x1000, table))
Reducing depth to 2 (from 7)
Common path: words.R!30719TCY
   time  alloc release  dups                ref         src
1 0.003  0.668       0     0 words.R!30719TCY#3 f0/strsplit
2 0.024 28.240       0 17393 words.R!30719TCY#5 f0/relist  
3 0.003  3.959       0  6617 words.R!30719TCY#6 f0/which

導致更精細的方法

f2 <- function(x, table)
{
    words <- strsplit(x, "\\W")
    len <- cumsum(sapply(words, length))
    idx <- cumsum(unlist(words) %in% table)
    which(idx[len] != c(0, idx[head(len, -1)]))
}

表現比較好

> identical(f2(x1000, table), f1(x1000, table))
[1] TRUE
> microbenchmark(f2(x1000, table), f1(x1000, table), times=5)
Unit: milliseconds
             expr       min        lq    median        uq       max neval
 f2(x1000, table) 25.426832 25.815504 25.844033 26.075279 26.387559     5
 f1(x1000, table)  2.963365  2.968197  2.984395  2.984423  3.129873     5

我認為f2和f1都可以很好地擴展到原始問題中的問題，只要有足夠的內存即可（如果可接受的單詞表與短語相比很小，那么我認為grep方法實際上會更有效地利用內存；最后，我想我可能會投票贊成簡單的grep解決方案！）。 grep方法的主要限制可能是正則表達式的大小受到限制，在我的計算機上約2560個條件

> grep(paste(as.character(1:2559), collapse="|"), "1")
[1] 1
> grep(paste(as.character(1:2560), collapse="|"), "1")
Error in grep(paste(as.character(1:2560), collapse = "|"), "1") : 
  invalid regular expression '1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|4

如何在R中有效地比較兩個不同的列（都包含字符串）？

問題描述

2 個解決方案

解決方案1
3 2014-03-09 20:49:27

解決方案2
3 已采納 2014-03-09 20:56:30

如何在R中有效地比較兩個不同的列（都包含字符串）？

問題描述

2 個解決方案

解決方案1 3 2014-03-09 20:49:27

解決方案2 3 已采納 2014-03-09 20:56:30

解決方案1
3 2014-03-09 20:49:27

解決方案2
3 已采納 2014-03-09 20:56:30