简体   繁体   中英

R - How to add rows for missing values for unique group sequences?

My problem is similar to this previous question Fastest way to add rows for missing values in a data.frame?

I can't figure out how to add rows padded with "NA" when the min/max is different by group.

> red<-data.frame(project = c(6, 6, 6, 6, 6, 9, 9, 9), period =c(1, 2, 5:7, 2, 4, 5), v3=letters[1:8], v4=c("red", "yellow", recursive = T))
> red
  project period v3     v4
1       6      1  a    red
2       6      2  b yellow 
3       6      5  c    red
4       6      6  d yellow
5       6      7  e    red
6       9      2  f yellow
7       9      4  g    red
8       9      5  h yellow

I want it to look like:

project period v3     v4
      6      1  a    red
      6      2  b yellow
      6      3 NA     NA
      6      4 NA     NA
      6      5  c    red
      6      6  d yellow
      6      7  e    red
      9      2  f yellow
      9      3 NA     NA
      9      4  g    red
      9      5  h yellow

When I used

library(data.table)
DT=as.data.table(red)
setkey(DT, project, period)

DT[CJ(unique(project), seq(min(period), max(period)))]

it made each project group have 7 periods; Project 6 should have periods 1-7, but Project 9 should have periods 2-5.

I've tried fiddling with .SD[ which.max(period)], by=project] but no cigar.

I thought it should be something simple in the seq(), but I tried seq(min(period, by=project)) with no luck

Thank you!

DT[setkey(DT[, .(min(period):max(period)), by = project], project, V1)]
#    project period v3     v4
# 1:       6      1  a    red
# 2:       6      2  b yellow
# 3:       6      3 NA     NA
# 4:       6      4 NA     NA
# 5:       6      5  c    red
# 6:       6      6  d yellow
# 7:       6      7  e    red
# 8:       9      2  f yellow
# 9:       9      3 NA     NA
#10:       9      4  g    red
#11:       9      5  h yellow

I don't know if this the idiomatic way or not, but I was able to achieve your desired output, by first creating an index and then subsetting the correct rows out of .SD per that index

DT[, indx := .GRP, project][, 
     .SD[CJ(unique(project), seq(min(period), max(period)))], indx]

#     indx project period v3     v4
#  1:    1       6      1  a    red
#  2:    1       6      2  b yellow
#  3:    1       6      3 NA     NA
#  4:    1       6      4 NA     NA
#  5:    1       6      5  c    red
#  6:    1       6      6  d yellow
#  7:    1       6      7  e    red
#  8:    2       9      2  f yellow
#  9:    2       9      3 NA     NA
# 10:    2       9      4  g    red
# 11:    2       9      5  h yellow

The accepted answer does not work (anymore?), but it is close.

setkey(DT,project,period)
DT[setkey(DT[, .(min(period):max(period)), by = project], project, V1)]

Note: 1. you need to make the period sequence into the list to work. 2. @MiamiCG, I am guessing you needed to allow cartesian because of not keying the table first. If you set it to TRUE, there will be no error message, but the result will not be correct.

Update: @eddi has updated his answer to match mine, so it is working.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM