简体   繁体   English

加快分位数计算

[英]Speed up quantile calculation

I am using the Hmisc Package to calculate the quantiles of two continous variables and compare the results in a crosstable. 我正在使用Hmisc包计算两个连续变量的分位数,并以交叉方式比较结果。 You find my code below. 你在下面找到我的代码。

My problem is that the calculation of the quantiles takes a considerable amount of time if the number of observations increases. 我的问题是,如果观测数量增加,分位数的计算需要相当长的时间。

Is there any possibility to speed up this procedure by using the data.table , ddply or any other package? 有没有可能通过使用data.tableddply或任何其他包加快此过程?

Thanks. 谢谢。

library(Hmisc)

# Set seed
set.seed(123)

# Generate some data
a <- sample(1:25, 1e7, replace=TRUE)
b <- sample(1:25, 1e7, replace=TRUE)
c <- data.frame(a,b)

# Calculate quantiles
c$a.quantile <- cut2(a, g=5)
c$b.quantile <- cut2(b, g=5)

# Output some descriptives
summaryM(a.quantile ~ b.quantile, data=c, overall=TRUE)

# Time spent for calculation:
#       User      System verstrichen 
#      25.13        3.47       28.73 

As stated by jlhoward and Ricardo Saporta data.table doesn't seem to speed up things too much in this case. 正如jlhoward和里卡多·萨波塔表示data.table似乎并没有加快的事情在这种情况太多了。 The cut2 function is clearly the bottleneck here. cut2功能显然是这里的瓶颈。 I used another function to calculate the quantiles (see Is there a better way to create quantile "dummies" / factors in R? ) and was able to decrease the calculation time by half: 我使用另一个函数来计算分位数(请参阅是否有更好的方法在R中创建分位数“假人”/因子? )并且能够将计算时间减少一半:

qcut <- function(x, n) {
  if(n<=2)
    { 
    stop("The sample must be split in at least 3 parts.")
  }
  else{
    break.values <- quantile(x, seq(0, 1, length = n + 1), type = 7)
    break.labels <- c(
      paste0(">=",break.values[1], " & <=", break.values[2]),
      sapply(break.values[3:(n)], function(x){paste0(">",break.values[which(break.values == x)-1], " & <=", x)}),
      paste0(">",break.values[(n)], " & <=", break.values[(n+1)]))
    cut(x, break.values, labels = break.labels,include.lowest = TRUE)
  }
}

c$a.quantile.2 <- qcut(c$a, 5)
c$b.quantile.2 <- qcut(c$b, 5)
summaryM(a.quantile.2 ~ b.quantile.2, data=c, overall=TRUE)

# Time spent for calculation:
#       User      System verstrichen 
#      10.22        1.47       11.70 

Using data.table would reduce the calculation time by another second, but I like the summary by the Hmisc package better. 使用data.table会将计算时间缩短一秒,但我更喜欢Hmisc包的摘要。

You can use data.table 's .N built in variable, to quickly tabulate. 您可以使用data.table.N内置变量来快速制表。

library(data.table)
library(Hmisc)

DT <- data.table(a,b)
DT[, paste0(c("a", "b"), ".quantile") := lapply(.SD, cut2, g=5), .SDcols=c("a", "b")]

DT[, .N, keyby=list(b.quantile, a.quantile)][, setNames(as.list(N), as.character(b.quantile)), by=a.quantile]

You can break that last line down into two steps, to see what is going on. 您可以将最后一行分解为两个步骤,以查看发生了什么。 The second "[ " simply reshapes the data in a clean format. 第二个"[ "只是以干净的格式重新整形数据。

DT.tabulated <- DT[, .N, keyby=list(b.quantile, a.quantile)]
DT.tabulated

DT.tabulated[, setNames(as.list(N), as.character(b.quantile)), by=a.quantile]

Data tables don't seem to improve things here: 数据表似乎没有在这里改进:

library(Hmisc)
set.seed(123)
a <- sample(1:25, 1e7, replace=TRUE)
b <- sample(1:25, 1e7, replace=TRUE)

library(data.table)
# original approach
system.time({
  c <- data.frame(a,b)
  c$a.quantile <- cut2(a, g=5)
  c$b.quantile <- cut2(b, g=5)
  smry.1 <-summaryM(a.quantile ~ b.quantile, data=c, overall=TRUE)
})
   user  system elapsed 
  72.79    6.22   79.02 

# original data.table approach
system.time({
  DT <- data.table(a,b)
  DT[, paste0(c("a", "b"), ".quantile") := lapply(.SD, cut2, g=5), .SDcols=c("a", "b")]
  smry.2 <- DT[, .N, keyby=list(b.quantile, a.quantile)][, setNames(as.list(N), as.character(b.quantile)), by=a.quantile]
})
   user  system elapsed 
  66.86    5.11   71.98 

# different data.table approach (simpler, and uses table(...))
system.time({
  dt     <- data.table(a,b)
  smry.3 <- table(dt[,lapply(dt,cut2,g=5)])
})
   user  system elapsed 
  67.24    5.02   72.26 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM