简体   繁体   中英

Efficient data.table rowwise and insertion of new columns

The dataset is very large and needs to be executed with parallelization. The following is a synthetic dataset:

require(data.table)
require(furrr)

Names<-c("Estimate","Std.Error","t-value","Pr(>|t|)") 
lm_summary<-function(Data){coef(summary(lm(Y~.,data =Data)))["X",]}
Synthetic_Data<-data.table(id=rep(seq(1,10000),each=1000),X=rnorm(1e6),Y=rnorm(1e6),key="id")
Synthetic_Data<-Synthetic_Data[,list(nested_DT=list(data.table(X,Y))),by="id"]

Ive tried this but it doesnt work.

plan(multisession,workers=6)
Synthetic_Data[,(Names):=future_map(nested_DT,lm_summary),.SDcols=Names]

It gives this error:: Supplied 4 columns to be assigned 10000 items. Please see NEWS for v1.12.2

However.This works perfectly fine

Synthetic_Data[,Model:=future_map(nested_DT,lm_summary)]

but instead of a Model object I need the Names columns appended to the data.table

The error message comes because map or lapply output a nrow * 4 list instead of a 4 * nrow list.
transpose solves this and seems quite efficient, without need for futures ( data.table has integrated multiprocessing capabilities):

Synthetic_Data[,(Names):=transpose(lapply(nested_DT,lm_summary))][]

Key: <id>
          id            nested_DT     Estimate  Std.Error     t-value    Pr(>|t|)
       <int>               <list>       <list>     <list>      <list>      <list>
    1:     1 <data.table[1000x2]>  -0.01190821 0.03114259  -0.3823769   0.7022632
    2:     2 <data.table[1000x2]>  -0.04105424  0.0302131   -1.358823   0.1745098
    3:     3 <data.table[1000x2]>   0.01960603 0.03129079   0.6265752    0.531081
    4:     4 <data.table[1000x2]>   0.02806479 0.03394502   0.8267719    0.408564
    5:     5 <data.table[1000x2]>  -0.08444368 0.03177666   -2.657412 0.008000118
   ---                                                                           
 9996:  9996 <data.table[1000x2]>  0.005208541 0.03169238   0.1643468   0.8694914
 9997:  9997 <data.table[1000x2]>  -0.02861342 0.03276352  -0.8733318   0.3826924
 9998:  9998 <data.table[1000x2]> -0.002026795 0.03287628 -0.06164917   0.9508546
 9999:  9999 <data.table[1000x2]>   -0.0118748 0.03031627  -0.3916973   0.6953655
10000: 10000 <data.table[1000x2]>   0.02973648 0.02981824   0.9972579   0.3188811

I have a solution but it is inelegant.

Synthetic_Dat<-cbind(Synthetic_Data,future_map_dfr(Synthetic_Data$nested_DT,lm_summary) %>% setDT(.))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM