简体   繁体   中英

Reshaping data.table with cumulative sum

I want to reshape a data.table, and include the historic (cumulative summed) information for each variable. The No variable indicates the chronological order of measurements for object ID . At each measurement additional information is found. I want to aggregate the known information at each timestamp No for object ID .

Let me demonstrate with an example:

For the following data.table:

df <- data.table(ID=c(1,1,1,2,2,2,2),
                 No=c(1,2,3,1,2,3,4), 
                 Variable=c('a','b', 'a', 'c', 'a', 'a', 'b'),
                 Value=c(2,1,3,3,2,1,5))
df
   ID No Variable Value
1:  1  1        a     2
2:  1  2        b     1
3:  1  3        a     3
4:  2  1        c     3
5:  2  2        a     2
6:  2  3        a     1
7:  2  4        b     5

I want to reshape it to this:

       ID No  a  b  c
    1:  1  1  2 NA NA
    2:  1  2  2  1 NA
    3:  1  3  5  1 NA
    4:  2  1 NA NA  3
    5:  2  2  2 NA  3
    6:  2  3  3 NA  3
    7:  2  4  3  5  3

So the summed values of Value , per Variable by (ID, No) , cumulative over No .

I can get the result without the cumulative part by doing

dcast(df, ID+No~Variable, value.var="Value")

which results in the non-cumulative variant:

   ID No  a  b  c
1:  1  1  2 NA NA
2:  1  2 NA  1 NA
3:  1  3  3 NA NA
4:  2  1 NA NA  3
5:  2  2  2 NA NA
6:  2  3  1 NA NA
7:  2  4 NA  5 NA

Any ideas how to make this cumulative? The original data.table has over 250,000 rows, so efficiency matters.

EDIT: I just used a,b,c as an example, the original file has about 40 different levels. Furthermore, the NA s are important; there are also Value -values of 0, which means something else than NA

POSSIBLE SOLUTION

Okay, so I've found a working solution. It is far from efficient, since it enlarges the original table.

The idea is to duplicate each row TotalNo - No times, where TotalNo is the maximum No per ID . Then the original dcast function can be used to extract the dataframe. So in code:

df[,TotalNo := .N, by=ID]
df2 <- df[rep(seq(nrow(df)), (df$TotalNo - df$No + 1))] #create duplicates
df3 <- df2[order(ID, No)]#, No:= seq_len(.N), by=.(ID, No)]
df3[,No:= seq(from=No[1], to=TotalNo[1], by=1), by=.(ID, No)]
df4<- dcast(df3, 
            formula = ID + No ~ Variable, 
            value.var = "Value", fill=NA, fun.aggregate = sum)

It is not really nice, because the creation of duplicates uses more memory. I think it can be further optimized, but so far it works for my purposes. In the sample code it goes from 7 rows to 16 rows, in the original file from 241,670 rows to a whopping 978,331. That's over a factor 4 larger.

SOLUTION Eddi has improved my solution in computing time in the full dataset (2.08 seconds of Eddi versus 4.36 seconds of mine). Those are numbers I can work with! Thanks everybody!

Your solution is good, but you're adding too many rows, that are unnecessary if you compute the cumsum beforehand:

# add useful columns
df[, TotalNo := .N, by = ID][, CumValue := cumsum(Value), by = .(ID, Variable)]

# do a rolling join to extend the missing values, and then dcast
dcast(df[df[, .(No = seq(No[1], TotalNo[1])), by = .(ID, Variable)],
         on = c('ID', 'Variable', 'No'), roll = TRUE],
      ID + No ~ Variable, value.var = 'CumValue')
#   ID No  a  b  c
#1:  1  1  2 NA NA
#2:  1  2  2  1 NA
#3:  1  3  5  1 NA
#4:  2  1 NA NA  3
#5:  2  2  2 NA  3
#6:  2  3  3 NA  3
#7:  2  4  3  5  3

Here's a standard way:

library(zoo)

df[, cv := cumsum(Value), by = .(ID, Variable)]
DT   = dcast(df, ID + No ~ Variable, value.var="cv")

lvls = sort(unique(df$Variable))
DT[, (lvls) := lapply(.SD, na.locf, na.rm = FALSE), by=ID, .SDcols=lvls]


   ID No  a  b  c
1:  1  1  2 NA NA
2:  1  2  2  1 NA
3:  1  3  5  1 NA
4:  2  1 NA NA  3
5:  2  2  2 NA  3
6:  2  3  3 NA  3
7:  2  4  3  5  3

One alternative way to do it is using a custom built cumulative sum function. This is exactly the method in @David Arenburg's comment, but substitutes in a custom cumulative summary function.

EDIT: Using @eddi's much more efficient custom cumulative sum function.

cumsum.na <- function(z){
 Reduce(function(x, y) if (is.na(x) && is.na(y)) NA else sum(x, y, na.rm = T), z, accumulate = T)
}

cols <- sort(unique(df$Variable))
res <- dcast(df, ID + No ~ Variable, value.var = "Value")[, (cols) := lapply(.SD, cumsum.na), .SDcols = cols, by = ID]
res

   ID No  a  b  c
1:  1  1  2 NA NA
2:  1  2  2  1 NA
3:  1  3  5  1 NA
4:  2  1 NA NA  3
5:  2  2  2 NA  3
6:  2  3  3 NA  3
7:  2  4  3  5  3

This definitely isn't the most efficient, but it gets the job done and gives you an admittedly very slow very slow cumulative summary function that handles NAs the way you want to.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM