简体   繁体   中英

Perform calculations in R on grouped rows and add result to existing column

I want to perform calculations by grouped rows in a data frame in R. My go-to for this would be to spread the column and do calculations on the columns but I want to also be able to do it without reshaping my data frame. For example, I want to perform a foldchange calculation on varA and varB for each subject, dividing the 'post' timepoint by the 'pre' timepoint, to make data frame df below look like df_foldchange. I want the calculation to be a new element within the existing 'timepoint' column.

df <- data.frame(subject = c('subject1', 'subject1', 'subject2', 'subject2'),
                 varA = c(1, 2, 1, 3),
                 varB = c(2, 3, 2, 4),
                 timepoint = c('pre', 'post', 'pre', 'post'))

df_foldchange <- data.frame(subject = c('subject1', 'subject1', 'subject1',
                             'subject2', 'subject2', 'subject2'),
                 varA = c(1, 2, 2, 1, 3, 3),
                 varB = c(2, 3, 1.5, 2, 4, 2),
                 timepoint = c('pre', 'post', 'foldchange', 
                               'pre', 'post', 'foldchange'))

I suspect you've mixed up your 'pre' / 'post' sequence in the construction of df ? The way you have it, you don't have a 'post' for 'subject1', or a 'pre' for 'subject2'.

You could do:

df <- data.frame(subject = c('subject1', 'subject1', 'subject2', 'subject2'),
                 varA = c(1, 2, 1, 3),
                 varB = c(2, 3, 2, 4),
                 timepoint = c('pre', 'post', 'pre', 'post'),
                 stringsAsFactors = FALSE)

df1 <- df %>% 
       group_by(subject) %>% 
       summarise(varA = varA[timepoint=='post'] / varA[timepoint=='pre'],
                 varB = varB[timepoint=='post'] / varB[timepoint=='pre'], 
                 timepoint = 'foldchange') 
df_foldchange <- df %>%
                 bind_rows(df1) %>%
                 arrange(subject)

# output
   subject varA varB  timepoint
1 subject1    1  2.0        pre
2 subject1    2  3.0       post
3 subject1    2  1.5 foldchange
4 subject2    1  2.0        pre
5 subject2    3  4.0       post
6 subject2    3  2.0 foldchange

You could sort the above to get exactly the output you want, if the order is important.

Using data.table you could do the following:

df <- data.frame(subject = c('subject1', 'subject1', 'subject2', 'subject2'),
                 varA = c(1, 2, 1, 3),
                 varB = c(2, 3, 2, 4),
                 timepoint = c('pre', 'post', 'pre', 'post'))

library(data.table)
setDT(df)#converting data frame into data.table
df2<- df[,lapply(.SD, function(x) x[timepoint=="post"]/x[timepoint=="pre"]),subject, .SDcols=varA:varB] #performing computation per columns requiered
df2[,timepoint:="foldchange"] #adding variable "foldchange"
df_foldchange <- rbind(df,df2) #binding per row
df_foldchange[order(subject)]

#output
    subject varA varB  timepoint
1: subject1    1  2.0        pre
2: subject1    2  3.0       post
3: subject1    2  1.5 foldchange
4: subject2    1  2.0        pre
5: subject2    3  4.0       post
6: subject2    3  2.0 foldchange

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM