I faced memory error to copy rows into columns using gather/unite/spread
technique in dplyr
in R
in another question here ( How to copy grouped rows into column by dplyr/tidyverse in R? ).
It is the data frame that I am using as an example: (Sorry, most of this question just replicates previous question)
df <- data.frame(
hid=c(1,1,1,1,2,2,2,2,2,3,3,3,3),
mid=c(1,2,3,4,1,2,3,4,5,1,2,3,4),
tmid=c("010","01010","010","01020",
"010","0120","010","010","020",
"010","01010","010","01020"),
thid=c("010","02020","010","02020",
"000","0120","010","010","010",
"010","02020","010","02020")
)
My desired output is show below:
hid mid tmid thid tmid_1 tmid_2 tmid_3 tmid_4 tmid_5 thid_1 thid_2 thid_3 thid_4 thid_5
* <dbl> <dbl> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr>
1 1 1 010 010 010 01010 010 01020 0 010 02020 010 02020 0
2 1 2 01010 02020 010 01010 010 01020 0 010 02020 010 02020 0
3 1 3 010 010 010 01010 010 01020 0 010 02020 010 02020 0
4 1 4 01020 02020 010 01010 010 01020 0 010 02020 010 02020 0
5 2 1 010 000 010 0120 010 010 020 000 0120 010 010 010
6 2 2 0120 0120 010 0120 010 010 020 000 0120 010 010 010
7 2 3 010 010 010 0120 010 010 020 000 0120 010 010 010
8 2 4 010 010 010 0120 010 010 020 000 0120 010 010 010
9 2 5 020 010 010 0120 010 010 020 000 0120 010 010 010
10 3 1 010 010 010 01010 010 01020 0 010 02020 010 02020 0
11 3 2 01010 02020 010 01010 010 01020 0 010 02020 010 02020 0
12 3 3 010 010 010 01010 010 01020 0 010 02020 010 02020 0
13 3 4 01020 02020 010 01010 010 01020 0 010 02020 010 02020 0
An image of this operation is shown below:
What I am trying to do in this operation are:
thid
and tmid
into column thid_x
and tmid_x
is defined by mid
; however, maximum number of mid
is not scalable (it spreads from 1 to 8-10 in actual large data set) mid
is grouped by hid
to define how many mid
s are stored in each hid
0
(ie, some hid
have 5 mid
s but some have only 4, thus tmid_5 should be 0 for such hid
) However, when I do this operation using gather/unite/spread
technique in the previous question, it encounters an memory error saying cannot allocate 11.4GB of memory.
Perhaps the reason of this error is that gather
function needs to create all combinations specified in its argument before splitting them up. Actual data frame has around 80,000 records which exceeds 16GB RAM in my 64-bit version of R
.
Do you have any suggestions to get the same outcome without making such huge intermediate records? Perhaps data.table
may help if it does not require such intermediate operation, however I used to use dplyr
and never used that package. I would like to have your idea to beyond this issue and would learn new package based on the need of analyses for further steps.
I think you can use a combination of spread
and left_join
to get what you need:
library(dplyr)
library(tidyr)
a <- select(df, -thid) %>%
spread(mid, tmid, sep="_") %>%
rename_at(vars(matches("^mid_")), funs(paste0("t", .)))
b <- select(df, -tmid) %>%
spread(mid, thid, sep="_") %>%
rename_at(vars(matches("^mid_")), funs(gsub("^m", "th", .)))
left_join(df, a, by="hid") %>%
left_join(b, by="hid")
# hid mid tmid thid tmid_1 tmid_2 tmid_3 tmid_4 tmid_5 thid_1 thid_2 thid_3 thid_4 thid_5
# 1 1 1 010 010 010 01010 010 01020 <NA> 010 02020 010 02020 <NA>
# 2 1 2 01010 02020 010 01010 010 01020 <NA> 010 02020 010 02020 <NA>
# 3 1 3 010 010 010 01010 010 01020 <NA> 010 02020 010 02020 <NA>
# 4 1 4 01020 02020 010 01010 010 01020 <NA> 010 02020 010 02020 <NA>
# 5 2 1 010 000 010 0120 010 010 020 000 0120 010 010 010
# 6 2 2 0120 0120 010 0120 010 010 020 000 0120 010 010 010
# 7 2 3 010 010 010 0120 010 010 020 000 0120 010 010 010
# 8 2 4 010 010 010 0120 010 010 020 000 0120 010 010 010
# 9 2 5 020 010 010 0120 010 010 020 000 0120 010 010 010
# 10 3 1 010 010 010 01010 010 01020 <NA> 010 02020 010 02020 <NA>
# 11 3 2 01010 02020 010 01010 010 01020 <NA> 010 02020 010 02020 <NA>
# 12 3 3 010 010 010 01010 010 01020 <NA> 010 02020 010 02020 <NA>
# 13 3 4 01020 02020 010 01010 010 01020 <NA> 010 02020 010 02020 <NA>
Cleaning up the NA
values should be easy enough, but may require you to re-factor them (add a level of "0"
) or just create the frame using stringsAsFactors=FALSE
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.