简体   繁体   中英

R performance with data reshaping

I am trying to reshape a data frame in R and it seems to have problems using the recommended ways of doing so. The data frame has the following structure:

ID                     DATE1             DATE2            VALTYPE        VALUE
'abcd1233'         2009-11-12        2009-12-23           'TYPE1'        123.45
...

VALTYPE is a string and is a factor with only 2 values (say TYPE1 and TYPE2 ). I need to transform it into the following data frame ("wide" transpose) based on common ID and DATEs:

ID                     DATE1             DATE2            VALUE.TYPE1  VALUE.TYPE2
'abcd1233'             2009-11-12        2009-12-23       123.45           NA
...

The data frame has more than 4,500,000 observations (although about 70% of VALUE s are NA ). The machine is an Intel-based Linux workstation with 4Gb of RAM. Loading the data (from a compressed Rdata file) into a fresh R process makes it grow to about 250Mb which clearly leaves a lot of space for reshaping.

These are my experiences so far:

  • Using vanilla reshape() method:

    tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"), timevar = "VALTYPE");

RESULT: Error: cannot allocate vector of size 4.8 Gb

  • Using cast() method of reshape package:

    tbl2 <- cast(tbl, ID + DATE1 + DATE2 ~ VALTYPE);

RESULT: R process consumes all RAM with no end in sight. Had to kill the process eventually.

  • Using by() and merge() :

    sp <- by(tbl[c(1,2,3,5)], tbl$VALTYPE, function(x) x); tbl <- merge(sp[["TYPE1"]], sp[["TYPE2"]], by = c("ID", "DATE1", "DATE2"), all = TRUE, sort = TRUE);

RESULT: works fine, although this is not very elegant and foolproof (ie it will break if more types are added).

To add insult to injury, the operation in question can be trivially achieved in about 3 lines of AWK or Perl (and with hardly any RAM used). So the question is: what is a better way to do this operation in R using recommended methods without consuming all available RAM?

A useful trick is to combine the id variables into a character vector and then do the reshape.

tbl$NEWID <- with(tbl, paste(ID, DATE1, DATE2, sep=";"))
tbl2 <- recast(tbl2, NEWID ~ VALTYPE, measure.var="VALUE")

It's about 40% faster in a problem of similar size in my intel core2 duo 2.2ghz macbook.

What about doing this in a non-R-like manner? I assume you have a TYPE1 and a TYPE2 row for each value of ID,DATE1,DATE2? Then sort the dataframe by those variables, and write a big for loop. You can repeatedly do rbind() operations to build the table, or you could try to pre-allocate the table (maybe) and just assign the VALUE.TYPE1 and VALUE.TYPE2 slots with [<-, which should do the assignment in-place.

(Note that if you're using rbind(), I believe that it's inefficient if you have any factor variables, so make sure everything is a character instead!)

也许你可以使用cat()函数?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM