简体   繁体   中英

Multi threaded data.table in R much slower than one using single thread

Here comes a simple example:

#Load libraries:
library(data.table)
library(tictoc)

#Set the number of threads:
setDTthreads(1)

#Set seed:
set.seed(1)

#Create a small example:
n_f<-2500
n_w<-sample(1:1000,n_f,replace=TRUE)
n_t<-sample(c("","AS","E","D","F"),n_f,replace=TRUE)
yearmonth<-ifelse(nchar(seq(1,12))==1,paste("0",seq(1,12),sep=""),seq(1,12))
yearmonth<-paste(rep(seq(1996,2018),each=12),yearmonth,sep="")

#Create a large synthetic data set:
data_final<-list("vector")
for (i in 1:length(yearmonth)){
  data_aux<-data.table(fid=rep(1:n_f,n_w),type=rep(n_t,n_w),date=yearmonth[i])
  data_final[[i]]<-data_aux
}

#Combine everything together:
data_final<-rbindlist(data_final)

#Do the calculation:
tic()
data_final[,nr_unique_type:=uniqueN(type),by=c("fid","date")]
toc()

On my machine the calculation takes about 23 seconds. On the other hand, if I do not specify setDTthreads(1) (and it uses 32 cores), it runs for 53 seconds. Maybe somebody could explain why with multithreading it is so much slower.

I am using R 3.6.0 and data.table 1.12.8

As per discussion in comments. This was known inefficiency caused by forming a team of CPU thread even when computing relatively small tasks. This has been improved in recent version of data.table (1.12.10). That has been verified by OP. It is not longer needed to use setDTthreads(1) to avoid this inefficiency.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM