Multi threaded data.table in R much slower than one using single thread

Question

Here comes a simple example:

#Load libraries:
library(data.table)
library(tictoc)

#Set the number of threads:
setDTthreads(1)

#Set seed:
set.seed(1)

#Create a small example:
n_f<-2500
n_w<-sample(1:1000,n_f,replace=TRUE)
n_t<-sample(c("","AS","E","D","F"),n_f,replace=TRUE)
yearmonth<-ifelse(nchar(seq(1,12))==1,paste("0",seq(1,12),sep=""),seq(1,12))
yearmonth<-paste(rep(seq(1996,2018),each=12),yearmonth,sep="")

#Create a large synthetic data set:
data_final<-list("vector")
for (i in 1:length(yearmonth)){
  data_aux<-data.table(fid=rep(1:n_f,n_w),type=rep(n_t,n_w),date=yearmonth[i])
  data_final[[i]]<-data_aux
}

#Combine everything together:
data_final<-rbindlist(data_final)

#Do the calculation:
tic()
data_final[,nr_unique_type:=uniqueN(type),by=c("fid","date")]
toc()

On my machine the calculation takes about 23 seconds. On the other hand, if I do not specify setDTthreads(1) (and it uses 32 cores), it runs for 53 seconds. Maybe somebody could explain why with multithreading it is so much slower.

I am using R 3.6.0 and data.table 1.12.8

Answer 1

As per discussion in comments. This was known inefficiency caused by forming a team of CPU thread even when computing relatively small tasks. This has been improved in recent version of data.table (1.12.10). That has been verified by OP. It is not longer needed to use setDTthreads(1) to avoid this inefficiency.

Multi threaded data.table in R much slower than one using single thread

Question

1 answers

solution1
2 2020-06-28 23:43:49

Multi threaded data.table in R much slower than one using single thread

Question

1 answers

solution1 2 2020-06-28 23:43:49

solution1
2 2020-06-28 23:43:49