简体   繁体   English

如何使用两个或更多列中的数据与R data.table的比较来应用函数

[英]How do I apply a function using comparisons of data in two or more columns with R data.table

I would like to apply a function to an R data table object that compares values in two columns and returns a result. 我想对R数据表对象应用一个函数,该对象比较两列中的值并返回结果。 Here's the example, for data table X: 这是数据表X的示例:

X <- as.data.table(list(POSITION=c(1,4,5,9,24,36,42,56),
   FIRST=c("A","BB","AA","B","AAA","B","A,B"),
   SECOND=c("B","AA","B","AAA","BBB","AB,ABB","B,A")))

   POSITION FIRST SECOND
1:        1     A      B
2:        4    BB     AA
3:        5    AA      B
4:        9     B    AAA
5:       24   AAA    BBB
6:       36     B AB,ABB
7:       42   A,B    B,A
8:       56     A      B

I would like to perform the following logical comparisons of the data in columns "FIRST" and "SECOND", to create a "RESULT" column: 我想对“ FIRST”和“ SECOND”列中的数据执行以下逻辑比较,以创建“ RESULT”列:

 SAME = length of FIRST and SECOND are both one character
 BLOCK = Character length of FIRST and SECOND are the same,
         but greater than one, and not mixed (i.e. no comma)
 LESS = SECOND has fewer characters, but neither is mixed
 MORE = SECOND has more characters, but neither is mixed
 MIXED = either firs of second contains a comma

Thus, the desired result would look like: 因此,所需的结果将如下所示:

POSITION FIRST SECOND RESULTS
1        A     B      SAME
4        BB    AA     BLOCK
5        A     B,A    MIXED    
9        AA    B      LESS
24       B     AAA    MORE
28       BBB   A,B    MIXED
36       AAA   BBB    BLOCK
42       B     AB,ABB MIXED
56       A,B   B,A    MIXED

So the following works, but is slow over a file with 4 million rows! 因此,以下方法可行,但是对于具有400万行的文件来说速度很慢!

X[, RESULT := ifelse(nchar(FIRST)+nchar(SECOND)==2,"SAME",
    ifelse(grepl(",", FIRST) | grepl(",",SECOND), "MIXED",
       ifelse(nchar(FIRST) > nchar(SECOND), "LESS",
          ifelse(nchar(FIRST) < nchar(SECOND), "MORE","BLOCK")))]

But it does give thew desired result: 但这确实给了您想要的结果:

   POSITION FIRST SECOND RESULT
1:        1     A      B   SAME
2:        4    BB     AA  BLOCK
3:        5    AA      B   LESS
4:        9     B    AAA   MORE
5:       24   AAA    BBB  BLOCK
6:       36     B AB,ABB  MIXED
7:       42   A,B    B,A  MIXED
8:       56     A      B   SAME

I actually have several more conditions to test, and some of them get more complicated that just character counts. 实际上,我还有更多条件要测试,其中有些条件变得更加复杂,以至于字符计数都没有。 Rather than a long ifelse statement, is it possible to apply a function, taking the two columns as input? 不用长的ifelse语句,是否可以将两列作为输入来应用函数? For example: 例如:

checkType <- function(x) {
  if(nchar(x$FIRST)+nchar(x$SECOND)==2) {
    type <- "SNP"
  } else if(!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) > nchar(x$SECOND))) {
    type <- "LESS"
  } else if(!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) < nchar(x$SECOND))) {
    type <- "MORE"
  } else if (!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) == nchar(x$SECOND)) & nchar(x$SECOND)>1) {
    type <-"BLOCK"
  } else {
    type <- "MIXED"
  }
  return(type)
}

> checkType(X[1,])
[1] "SAME"

for(i in 1:nrow(X)) X[i, RESULT := checkType(X[i,])]

So while the above works, it's obviously not the optimal way to run things with data.table. 因此,尽管上述方法可行,但这显然不是使用data.table运行事物的最佳方法。 However, I tried lapply and apply, but neither work: 但是,我尝试套用并套用,但均无效:

X[, RESULT3 := lapply(.SD, checkType)]
 Error in x$FIRST : $ operator is invalid for atomic vectors 
  nchar(x$FIRST) 
  FUN(X[[1L]], ...) 
  lapply(.SD, checkType) 
  eval(expr, envir, enclos) 
  eval(jsub, SDenv, parent.frame()) 
  `[.data.table`(X, , `:=`(RESULT3, lapply(.SD, checkType))) 
  X[, `:=`(RESULT3, lapply(.SD, checkType))] 

Same result with apply(.SD, 1, checkType). 与apply(.SD,1,checkType)的结果相同。 Is what I am trying to do possible by applying a function? 通过应用函数,我试图做的事情有可能吗?

Note that the data table produced by your code (first line below, pasted from your snippet above), is not the same as the data table shown in the "desired results" box below it. 请注意,由您的代码生成的数据表(下面的第一行,是从上面的代码段粘贴的) 下面的“所需结果”框中显示的数据表不同。

Nevertheless, this might actually be faster, and would definitely be easier to understand. 但是,这实际上可能会更快,并且绝对会更容易理解。 It produces a result which I think is consistent with your rules. 我认为这与您的规则相符。

X <- as.data.table(list(POSITION=c(1,4,5,9,24,36,42,56),
                        FIRST=c("A","BB","AA","B","AAA","B","A,B"),
                        SECOND=c("B","AA","B","AAA","BBB","AB,ABB","B,A")))

X$mixed <- grepl(',',X$FIRST) | grepl(',',X$SECOND)
X$nf    <- nchar(X$FIRST)
X$ns    <- nchar(X$SECOND)
X$RESULT = ""

setkey(X,nf,ns)
X[J(1,1),RESULT:="SAME"]
X[!mixed & nf==ns & nf>1 & ns>1]$RESULT <- "BLOCK"
X[!mixed & nf > ns]$RESULT <- "LESS"
X[!mixed & nf < ns]$RESULT <- "MORE"
X[(mixed)]$RESULT <- "MIXED"
setkey(X,POSITION)

Your categories are not mutually exclusive, so I assume these rules apply in order (for example what about FIRST="," and SECOND="," ? 您的类别不是互相排斥的,因此我假设这些规则按顺序适用(例如FIRST=","SECOND=","呢?

Also, I think your definitions of MORE and LESS are the same. 另外,我认为您对MORE和LESS的定义是相同的。

So both the answers from @Frank and @jlhoward give the desired result, and were much quicker than my initial attempt. 因此,来自@Frank和@jlhoward的答案均提供了理想的结果,并且比我最初的尝试要快得多。 From these answers however, this approach (createResult1) was about 4 times faster over a file with 1,000,000 rows: 但是,从这些答案中,这种方法(createResult1)的速度比具有1,000,000行的文件快4倍:

createResult1 <- function(X) {
  X[,`:=`(
    cf=nchar(FIRST),
    cs=nchar(SECOND),
    mf=grepl(',',FIRST),
    ms=grepl(',',SECOND)
    )]
  X[cf==1&cs==1, RESULT:="SAME"]
  X[cf > cs, RESULT:="LESS"]
  X[cf < cs, RESULT:="MORE"]
  X[cf==cs & cs>1, RESULT:="BLOCK"]
  X[(mf)|(ms), RESULT:="MIXED"]
  X[,c('cf','cs','mf','ms'):=NULL]
  return(X)
}

createResult2 <- function(X) { #@Frank
  X[,`:=`(
    cf=nchar(FIRST),
    cs=nchar(SECOND),
    mf=grepl(',',FIRST),
    ms=grepl(',',SECOND)
  )][,RESULT:=ifelse(cf==1&cs==1,"SAME",
                     ifelse(mf | ms, "MIXED",
                            ifelse(cf > cs, "LESS",
                                   ifelse(cf < cs, "MORE","BLOCK"))))
     ][
       ,c('cf','cs','mf','ms'):=NULL
        ]
  return(X)
}

createResult3 <- function(X) { #@jlhoward
  X$mixed <- grepl(',',X$FIRST) | grepl(',',X$SECOND)
  X$nf    <- nchar(X$FIRST)
  X$ns    <- nchar(X$SECOND)
  X$RESULT = ""

  setkey(X,nf,ns)
  X[J(1,1),RESULT:="SAME"]
  X[!mixed & nf==ns & nf>1 & ns>1]$RESULT <- "BLOCK"
  X[!mixed & nf > ns]$RESULT <- "LESS"
  X[!mixed & nf < ns]$RESULT <- "MORE"
  X[(mixed)]$RESULT <- "MIXED"
  X[,c('nf','ns','mixed'):=NULL]
  setkey(X,POSITION)
  return(X)
}

Create the same data table as above, but with 1,000,000 rows 创建与上述相同的数据表,但具有1,000,000行

X <- as.data.table(list(POSITION=rep(c(1,4,5,9,24,36,42,56),1000000),
                        FIRST=rep(c("A","BB","AA","B","AAA","B","A,B"),1000000),
                        SECOND=rep(c("B","AA","B","AAA","BBB","AB,ABB","B,A"),1000000)))
Y <- copy(X)
Z <- copy(X)

Here are the results: 结果如下:

> system.time(X <- createResult1(X))
   user  system elapsed 
   4.06    0.05    4.12
> system.time(Y <- createResult2(Y))
   user  system elapsed 
  18.53    0.36   18.94 
> system.time(Z <- createResult2(Z))
   user  system elapsed 
  18.63    0.29   18.97 
> identical(X,Y)
[1] TRUE
> identical(X,Z)
[1] TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM