简体   繁体   English

找到R中匹配元素的数量

[英]Find the count of matching elements in R

sorry for the noob question ! 对不起这个菜鸟问题! I am trying to calculate the number of elements that matched in basket x with basket y. 我正在尝试计算在篮子x中与篮子y匹配的元素数量。 I have the following data: 我有以下数据:

user_id basket.x basket.y
1         1,2,3    2,3,4
2         5,6,7    1,2,7

I have tried the following loop, but it doesnot work 我尝试了以下循环,但它不起作用

df["total"] <- 0
df["TP"] <- 0
for(i in 1:nrow(df)){
 for(j in 1:nrow(df)){
  if(all(df$basket.x[i] %in% df$basket.y[j])){
     df$total <- total + 1
     df$TP <- TP + 1
  }
 }
}

And returns this: 并返回:

user_id basket.x basket.y   total TP
1         1,2,3    2,3,4     0    0
2         5,6,7    1,2,7     0    0

However, the desired result is: 但是,期望的结果是:

user_id basket.x basket.y   total TP
1         1,2,3    2,3,4     3    2
2         5,6,7    1,2,7     3    1

Could anyone point me please where i am made mistake ? 有人能指出我在哪里弄错了吗? Thank you 谢谢

Running the dput() : 运行dput():

structure(list(user_id = c(2957L, 7306L, 10219L, 11290L, 13222L, 
13554L), basket.x = c("13870,22963,1158,18362"),basket.y = 
c("24852,432,47626,33647,6015,1158,24852,24852,24852")
), row.names = c(NA, 
6L), class = "data.frame")

As noted by @JohnColeman, there is something wrong with your dput so I am using a combination of that and your original example. 正如@JohnColeman所说,你的dput所以我正在使用它和你原来的例子的组合。

df = structure(list(user_id = c(2957L, 7306L, 10219L), 
basket.x = c("13870,22963,1158,18362", "1,2,3", "5,6,7"),
basket.y = c("24852,432,47626,33647,6015,1158,24852,24852,24852",
"2,3,4", "1,2,7")
), row.names = c(1L,2L,3L), class = "data.frame")
df
  user_id               basket.x
1    2957 13870,22963,1158,18362
2    7306                  1,2,3
3   10219                  5,6,7
                                           basket.y
1 24852,432,47626,33647,6015,1158,24852,24852,24852
2                                             2,3,4
3                                             1,2,7

Using this data, we can get the individual elements of the lists using strsplit . 使用这些数据,我们可以使用strsplit获取列表的各个元素。 Once we have the elements, we can use intersect to find the elements that are in both basket.x and basket.y . 一旦我们有了元素,我们就可以使用intersect来查找basket.xbasket.y的元素。 To get how many elements the two baskets share, we can just take the length of the intersection. 为了获得两个篮子共享的元素数量,我们可以采用交叉点的长度。 Of course, we need to apply this across all of the rows of df . 当然,我们需要在df所有行中应用它。 Putting this together, we get 把它们放在一起,我们得到了

sapply(1:nrow(df), function(i) 
    length(intersect(strsplit(df$basket.x, ",")[[i]],
            strsplit(df$basket.y, ",")[[i]])))
[1] 1 2 1

Edit Thanks to @thelatemail for noticing that the way I wrote this is very inefficient. 编辑感谢@thelatemail注意到我写这个的方式非常低效。 Better is: 更好的是:

sapply(1:nrow(df), function(i) 
    length(intersect(unlist(strsplit(df$basket.x[[i]], ",")),
            unlist(strsplit(df$basket.y[[i]], ",")))))

A variation on @G5W's answer would be possible through Map to replace (well, hide) the loop over each row index: 可以通过Map替换(好吧,隐藏)每个行索引上的循环来改变@ G5W的答案:

spl <- unname(lapply(df[-1], strsplit, ","))
lengths(do.call(Map, c(intersect, spl)))
#[1] 1 2 1

Although you have to save the intermediate spl , this should be significantly faster if you're dealing with larger datasets. 虽然您必须保存中间spl ,但如果您处理较大的数据集,这应该会快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM