简体   繁体   中英

How to copy grouped rows into column by data.table in R?

I faced memory error to copy rows into columns using gather/unite/spread technique in dplyr in R in another question here ( How to copy grouped rows into column by dplyr/tidyverse in R? ).

It is the data frame that I am using as an example: (Sorry, most of this question just replicates previous question)

df <- data.frame(
    hid=c(1,1,1,1,2,2,2,2,2,3,3,3,3),
    mid=c(1,2,3,4,1,2,3,4,5,1,2,3,4),
    tmid=c("010","01010","010","01020",
           "010","0120","010","010","020",
           "010","01010","010","01020"),
    thid=c("010","02020","010","02020",
           "000","0120","010","010","010",
           "010","02020","010","02020")
    )

My desired output is show below:

     hid   mid  tmid   thid  tmid_1  tmid_2  tmid_3  tmid_4  tmid_5  thid_1  thid_2  thid_3  thid_4  thid_5
 * <dbl> <dbl> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> 
 1     1     1   010    010    010  01010    010  01020      0    010  02020    010  02020      0
 2     1     2 01010  02020    010  01010    010  01020      0    010  02020    010  02020      0
 3     1     3   010    010    010  01010    010  01020      0    010  02020    010  02020      0
 4     1     4 01020  02020    010  01010    010  01020      0    010  02020    010  02020      0
 5     2     1   010    000    010  0120     010    010    020    000   0120    010    010    010
 6     2     2  0120   0120    010  0120     010    010    020    000   0120    010    010    010
 7     2     3   010    010    010  0120     010    010    020    000   0120    010    010    010
 8     2     4   010    010    010  0120     010    010    020    000   0120    010    010    010
 9     2     5   020    010    010  0120     010    010    020    000   0120    010    010    010
10     3     1   010    010    010  01010    010  01020      0    010  02020    010   02020     0
11     3     2 01010  02020    010  01010    010  01020      0    010  02020    010   02020     0
12     3     3   010    010    010  01010    010  01020      0    010  02020    010   02020     0
13     3     4 01020  02020    010  01010    010  01020      0    010  02020    010   02020     0

An image of this operation is shown below: 在此处输入图片说明

What I am trying to do in this operation are:

  • Converting thid and tmid into column
  • Suffix number in thid_x and tmid_x is defined by mid ; however, maximum number of mid is not scalable (it spreads from 1 to 8-10 in actual large data set)
  • mid is grouped by hid to define how many mid s are stored in each hid
  • If value does not exist, it should be padded by 0 (ie, some hid have 5 mid s but some have only 4, thus tmid_5 should be 0 for such hid )

However, when I do this operation using gather/unite/spread technique in the previous question, it encounters an memory error saying cannot allocate 11.4GB of memory.

Perhaps the reason of this error is that gather function needs to create all combinations specified in its argument before splitting them up. Actual data frame has around 80,000 records which exceeds 16GB RAM in my 64-bit version of R .

Do you have any suggestions to get the same outcome without making such huge intermediate records? Perhaps data.table may help if it does not require such intermediate operation, however I used to use dplyr and never used that package. I would like to have your idea to beyond this issue and would learn new package based on the need of analyses for further steps.

I think you can use a combination of spread and left_join to get what you need:

library(dplyr)
library(tidyr)

a <- select(df, -thid) %>%
  spread(mid, tmid, sep="_") %>%
  rename_at(vars(matches("^mid_")), funs(paste0("t", .)))
b <- select(df, -tmid) %>%
  spread(mid, thid, sep="_") %>%
  rename_at(vars(matches("^mid_")), funs(gsub("^m", "th", .)))

left_join(df, a, by="hid") %>%
  left_join(b, by="hid")
#    hid mid  tmid  thid tmid_1 tmid_2 tmid_3 tmid_4 tmid_5 thid_1 thid_2 thid_3 thid_4 thid_5
# 1    1   1   010   010    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 2    1   2 01010 02020    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 3    1   3   010   010    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 4    1   4 01020 02020    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 5    2   1   010   000    010   0120    010    010    020    000   0120    010    010    010
# 6    2   2  0120  0120    010   0120    010    010    020    000   0120    010    010    010
# 7    2   3   010   010    010   0120    010    010    020    000   0120    010    010    010
# 8    2   4   010   010    010   0120    010    010    020    000   0120    010    010    010
# 9    2   5   020   010    010   0120    010    010    020    000   0120    010    010    010
# 10   3   1   010   010    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 11   3   2 01010 02020    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 12   3   3   010   010    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 13   3   4 01020 02020    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>

Cleaning up the NA values should be easy enough, but may require you to re-factor them (add a level of "0" ) or just create the frame using stringsAsFactors=FALSE .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM