简体   繁体   English

R:比较矩阵中的字段

[英]R: Comparing fields in matrix

I've got two data frames I want to compare: If a specific location in both data frames meet a requirement assign "X" to that specific location in a seperate data frame. 我有两个要比较的数据帧:如果两个数据帧中的特定位置都满足要求,则将“ X”分配给单独数据帧中的该特定位置。

How can I get the expected output in an efficient way? 如何有效地获得预期的输出? The real data frame contains 1000 columns with thousands to millions of rows. 实际data frame包含1000列,其中包含数千行到数百万行。 I think data.table would be the quickest option, but I don't have a grasp of how data.table works yet 我认为data.table将是最快的选择,但我还不了解data.table工作方式

Expected output: 预期产量:

> print(result)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "A"  "A"  "O"  "X"  "X"  "X"  "X"  "O"  "O" 
# [2,] "A"  "A"  "O"  "X"  "X"  "X"  "X"  "O"  "O" 
# [3,] "A"  "A"  "O"  "X"  "X"  "X"  "X"  "O"  "X" 

My code: 我的代码:

df1 <- structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 
            2, 2, 2, 2, 3, 3, 3, 2, 0, 1), .Dim = c(3L, 9L), .Dimnames = list(
              c("A", "B", "C"), NULL))
df2 <- structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 
            2, 2, 2, 2, 1, 3, 3, 4, 4, 2), .Dim = c(3L, 9L), .Dimnames = list(
              c("A", "B", "C"), NULL))

result <- matrix("O", nrow(df1), ncol(df1))


for (i in 1:nrow(df1)) 
{
  for (j in 3:ncol(df1)) 
  {
    result[i,1] = c("A")
    result[i,2] = c("A")
    if (is.na(df1[i,j]) || is.na(df2[i,j])){
      result[i,j] <- c("N")
    }
    if (!is.na(df1[i,j]) && !is.na(df2[i,j]) && !is.na(df2[i,j]))
    {

      if (df1[i,j] %in% c("0","1","2") & df2[i,j] %in% c("0","1","2")) {
        result[i,j] <- c("X") 
      }
    }
  }
}   


print(result)

Edit 编辑

I like both @David's and @Heroka's solutions. 我喜欢@David和@Heroka的解决方案。 On a small dataset, Heroka's solution is 125x as fast as the original, and David's is 29 times as fast. 在一个小型数据集上,Heroka的解决方案的速度是原始解决方案的125倍,而David的解决方案的速度是原来的29倍。 Here's the benchmark: 这是基准:

> mbm
Unit: milliseconds
             expr        min          lq       mean      median          uq        max neval
         original 1058.81826 1110.481659 1131.81711 1112.848211 1124.775989 1428.18079   100
           Heroka    8.46317    8.711986    9.03517    8.914616    9.067793   18.06716   100
 DavidAarenburg()   35.58350   36.660565   39.85823   37.061160   38.175700   53.83976   100

Thanks alot guys! 谢谢大家!

You have matrices, not dataframes. 您有矩阵,没有数据框。

One approach might be to use ifelse (and %in% a numeric variable, saves about 50% of the time to avoid the time-conversion.: 一种方法可能是使用ifelse(和%in%一个数字变量,可以节省大约50%的时间以避免时间转换。):

  result <- ifelse(is.na(df1)|is.na(df2),"N",
                   ifelse(df1 %in% 0:2 & df2 %in% 0:2,"X","O"))
  result[,1:2] <- "A"
  result

With thanks to @DavidArenburg, more improvement in speed 借助@DavidArenburg,速度有了更大的提高

result <- matrix("O",nrow=nrow(df1),ncol=ncol(df1))
result[is.na(df1) | is.na(df2)] <- "N"
result[df1 < 3 & df2 < 3] <- "X"
result[, 1:2] <- "A"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM