简体   繁体   中英

Merging data.tables uses more than 10 GB RAM

I have two data.tables: DT and meta . When I merge them using DT[meta] , memory usage increases by more than 10 GB (and the merge is very slow). What's going wrong? It seems like the merge is successful, but I can only look at single lines, otherwise I run out of memory. DT itself was created by merging two data.tables without any problems.

Edit:

It seems to be a problem with the key. I can do the following without a problem:

DT[,id:=1:nrow(DT)]
meta[,id:=1:nrow(DT)]
setkey(DT,id)
setkey(meta,id)

DT2<-DT[meta]   # Comment from Matthew Dowle:
                # X[Y] (or merge) on a key of 1:nrow(DT) is just a cbind, isn't it? 

unique(DT2[,"Moor_ID",with=F]==DT2[,"Moor_ID.1",with=F])
     Moor_ID
[1,]    TRUE

First data.table:

str(DT)
Classes ‘data.table’ and 'data.frame':  10212 obs. of  55 variables:
 $ DWD_ID                 : chr  "Bremerhav" "Bremerhav" "Bremerhav" "Bremerhav" ...
 $ numdays                : int  1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 ...
 $ days                   : Date, format: "2009-09-01" "2009-09-02" "2009-09-03" "2009-09-04" ...
 $ TBoden_dayAnzahl       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ TBoden_dayMin          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayMax          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayMeanAR       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayStabw        : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayMedian       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayMeanMM       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ T2m_dayAnzahl          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ T2m_dayMin             : num  15.6 13.8 13.7 12.8 13.5 13.1 13.3 13.8 15.9 13.7 ...
 $ T2m_dayMax             : num  25.6 19.9 18.1 18.1 16.9 18.6 21 25.7 19.3 17.6 ...
 $ T2m_dayMeanAR          : num  19 16.9 15.6 15.2 14.8 ...
 $ T2m_dayStabw           : num  3.409 2.048 1.334 1.726 0.965 ...
 $ T2m_dayMedian          : num  17.2 16.8 15.2 14.8 14.5 ...
 $ T2m_dayMeanMM          : num  20.6 16.9 15.9 15.4 15.2 ...
 $ T10cm_dayAnzahl        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ T10cm_dayMin           : num  14.3 12.6 12.9 12.2 12.7 12 12.8 11.7 15.1 12.2 ...
 $ T10cm_dayMax           : num  27.7 20.9 18.7 18.7 17.4 19.8 22.4 25.9 21.8 18.6 ...
 $ T10cm_dayMeanAR        : num  18.7 16.5 14.9 15.1 14.5 ...
 $ T10cm_dayStabw         : num  4.36 2.84 1.73 2.36 1.54 ...
 $ T10cm_dayMedian        : num  16.1 15.6 14.3 14.2 14 ...
 $ T10cm_dayMeanMM        : num  21 16.8 15.8 15.4 15.1 ...
 $ RF_dayAnzahl           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ RF_dayMin              : num  45 58 73 56 68 62 63 44 65 58 ...
 $ RF_dayMax              : num  94 94 94 93 94 92 84 84 89 84 ...
 $ RF_dayMean             : num  68.6 76.3 78.9 74.4 86.5 ...
 $ RF_dayStabw            : num  17.09 12.53 5.88 9.83 5.62 ...
 $ RF_dayMedian           : num  64.5 74 77.5 76 87.5 77.5 75 63 77 76 ...
 $ Luftdruck_dayMean      : num  100.8 101 99.7 99.9 101.1 ...
 $ es_day                 : num  2.53 1.95 1.82 1.78 1.74 ...
 $ ea_day                 : num  1.57 1.42 1.49 1.27 1.38 ...
 $ defi_day               : num  0.956 0.535 0.327 0.509 0.355 ...
 $ Nebel_dayAnteil        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Sonnenscheind_dayAnzahl: int  18 18 18 18 18 18 18 18 18 18 ...
 $ Sonnenscheind_daySum   : num  6.63 4.93 1.05 5.82 3.27 ...
 $ julian_day             : int  244 245 246 247 248 249 250 251 252 253 ...
 $ zeta_day               : num  2.81 2.82 2.84 2.86 2.88 ...
 $ maxSonnenscheind       : num  13.9 13.8 13.7 13.6 13.5 ...
 $ R0_day                 : num  2920 2890 2860 2830 2799 ...
 $ Globalstrahlung_dayMean: num  NA NA NA NA NA NA NA NA NA NA ...
 $ RG_day                 : num  13.24 11.19 6.64 12.02 9.03 ...
 $ lambdaET_day           : num  2.45 2.46 2.46 2.46 2.47 ...
 $ sAnstieg_day           : num  0.15 0.122 0.116 0.113 0.111 ...
 $ gamma_day              : num  0.067 0.0669 0.0659 0.0661 0.0668 ...
 $ ETp_TW_day             : num  2.71 2.15 1.28 2.24 1.68 ...
 $ Moor_ID                : chr  "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
 $ Distanz_in_km          : num  24 24 24 24 24 ...
 $ North                  : num  53.5 53.5 53.5 53.5 53.5 ...
 $ East                   : num  8.58 8.58 8.58 8.58 8.58 ...
 $ Hoehe_in_m             : num  7 7 7 7 7 7 7 7 7 7 ...
 $ Kueste_km              : num  20 20 20 20 20 ...
 $ peatland               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ diffmaxt2m             : num  -1.6 0 -0.2 0.1 -0.4 ...
 - attr(*, "sorted")= chr "Moor_ID"
 - attr(*, ".internal.selfref")=<externalptr> 

Second data.table:

str(meta)
Classes ‘data.table’ and 'data.frame':  10212 obs. of  6 variables:
 $ Moor_ID        : chr  "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
 $ Hoehe_Moor     : num  2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 ...
 $ Kueste_km      : num  15.7 15.7 15.7 15.7 15.7 ...
 $ WSPsommer_muGOK: num  0.699 0.699 0.699 0.699 0.699 ...
 $ WSPwinter_muGOK: num  0.446 0.446 0.446 0.446 0.446 ...
 $ Moorgroesse_km2: num  59 59 59 59 59 59 59 59 59 59 ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr "Moor_ID"

Session info:

R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] grDevices datasets  splines   graphics  stats     tcltk     utils     methods   base     

other attached packages:
[1] reshape_0.8.4    plyr_1.7.1       data.table_1.8.0 svSocket_0.9-53  TinnR_1.0-5      R2HTML_2.2       Hmisc_3.9-3     
[8] survival_2.36-14

loaded via a namespace (and not attached):
[1] cluster_1.14.2 grid_2.15.1    lattice_0.20-6 svMisc_0.9-65  tools_2.15.1 

My bad. The problem was that keys were not unique:

a<-data.table(x=c(1,1),y=c(1,2))
b<-data.table(x=c(1,1),y=c(3,4))
setkey(a,x)
setkey(b,x)
a[b]
     x y y.1
[1,] 1 1   3
[2,] 1 2   3
[3,] 1 1   4
[4,] 1 2   4

It would be nice if data.table could give a warning for that.


Update from Matthew

This warning has now been implemented in v1.8.7 :

New argument allow.cartesian ( default FALSE ) added to X[Y] and merge(X,Y) , #2464. Prevents large allocations due to misspecified joins; eg, duplicate key values in Y joining to the same group in X over and over again. The word cartesian is used loosely for when more than max(nrow(X),nrow(Y)) rows would be returned. The error message is verbose and includes advice.

也许其他函数可以更好地工作,例如merge()cbind()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM