简体   繁体   中英

R- Subtracting the mean of a group from each element of that group in a dataframe

I am trying to merge a vector 'means' to a dataframe. My dataframe looks like this Data = growth

I first calculated all the means for the different groups (1 group = population + temperature + size + replicat) using this command:

means<-aggregate(TL ~ Population + Temperature + Replicat + Size + Measurement, data=growth, list=growth$Name, mean)        

Then, I selected the means for Measurement 1 as follows as I am only interested in these means.

meansT0<-means[which(means$Measurement=="1"),]    

Now, I would like to merge this vector of means values to my dataframe (=growth) so that the right mean of each group corresponds to the right part of the dataframe.

The goal is to then substrat the mean of each group (at Measurement 1) to each element of the dataframe based on its belonging group (and for all other Measurements except Measurement 1). Maybe there is no need to add the means column to the dataframe? Do you know any command to do that ?

[27.06.18] I made up this simplified dataframe, I hope this help understanding. So, what I want is to substrat, for each individual in the dataframe and for each measurement (here only Measurement 1 and Measurement 2, normally I have more), the mean of its belongig group at MEASUREMENT 1.

So, if I get the means by group ( 1 group = Population + Temperature + Measurement):

means<-aggregate(TL ~ Population + Temperature + Measurement, data=growth, list=growth$Name, mean)
means               

I got these values of means (in this example) :

Population Temperature Measurement       TL
JUB          15           **1**           **12.00000**
JUB          20           **1**           **15.66667**
JUB          15           2           17.66667
JUB          20           2           18.66667
JUB          15           3           23.66667
JUB          20           3           24.33333

We are only interested by the means at MEASUREMENT 1. For each individual in the dataframe, I want to substrat the mean of its belonging group at Measurement 1: in this example (see dataframe with R command): - for the group JUB+15+Measurement 1 , mean = 12 - for the group JUB+20+Measurement 1 , mean = 15.66

growth<-data.frame(Population=c("JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB"), Measurement=c("1","1","1","1","1","1","2","2","2","2","2","2", "3", "3", "3", "3", "3", "3"),Temperature=c("15","15","15","20", "20", "20","15","15","15","20", "20", "20","15","15","15","20", "20", "20"),TL=c(11,12,13,15,18,14, 16,17,20,21,19,16, 25,22,24,26,24,23), New_TL=c("11-12", "12-12", "13-12", "15-15.66", "18-15.66", "14-15.66", "16-12", "17-12", "20-12", "21-15.66", "19-15.66", "16-15.66", "25-12", "22-12", "24-12", "26-15.66", "24-15.66", "23-15.66"))    
print(growth)

I hope with this, you can understand better what I am trying to do. I have a lot of data and if I have to do this manually, this will take me a lot of time and increase the risk of me putting mistakes.

Here is an option with tidyverse . After grouping by the group columns, use mutate_at specifying the columns of interest and get the difference of that column ( . ) with the mean of it.

library(tidyverse)
growth %>% 
       group_by(Population, Temperature, Replicat, Size, Measurement) %>% 
       mutate_at(vars(HL, TL), funs(MeanGroupDiff = . 
                  - mean(.[Measurement == 1])))

Using a reproducible example with mtcars dataset

data(mtcars)
mtcars %>%
   group_by(cyl, vs) %>% 
   mutate_at(vars(mpg, disp), funs(MeanGroupDiff = .- mean(.[am==1])))

Have you considered using the data.table package? It is very well suited for doing these kind of grouping, filtering, joining, and aggregation operations you describe, and might save you a great deal of time in the long run.

The code below shows how a workflow similar to the one you described but based on the built in mtcars data set might look using data.table .

To be clear, there are also ways to do what you describe using base R as well as other packages like dplyr , just throwing out a suggestion based on what I have found the most useful for my personal work.

library(data.table)

## Convert mtcars to a data.table
## only include columns `mpg`, `cyl`, `am` and `gear` for brevity
DT <- as.data.table(mtcars)[, .(mpg, cyl,am, gear)]

## Take a subset where `cyl` is equal to 6
DT <- DT[cyl == 6]

## Calculate grouped mean based on `gear` and `am` as grouping variables
DT[,group_mpg_avg := mean(mpg), keyby = .(gear, am)]

## Calculate each row's difference from the group mean
DT[,mpg_diff_from_group := mpg - group_mpg_avg]

print(DT)

#     mpg cyl am gear group_mpg_avg mpg_diff_from_group
# 1: 21.4   6  0    3         19.75                1.65
# 2: 18.1   6  0    3         19.75               -1.65
# 3: 19.2   6  0    4         18.50                0.70
# 4: 17.8   6  0    4         18.50               -0.70
# 5: 21.0   6  1    4         21.00                0.00
# 6: 21.0   6  1    4         21.00                0.00
# 7: 19.7   6  1    5         19.70                0.00

Consider by to subset your data frame by factors (but leave out Measurement in order to compare group 1 and all other groups). Then, run an ifelse conditional logic calculation for needed columns. Since by will return a list of data frames, bind all outside with do.call() :

df_list <- by(growth, growth[,c("Population", "Temperature")], function(sub) {
  # TL CORRECTION      
  sub$Correct_TL <- ifelse(sub$Measurement != 1, 
                           sub$TL - mean(subset(sub, Measurement == 1)$TL),
                           sub$TL)
  # ADD OTHER CORRECTIONS

  return(sub)  
})

final_df <- do.call(rbind, df_list)

Output (using posted data)

final_df

#    Population Measurement Temperature TL   New_TL Correct_TL
# 1         JUB           1          15 11    11-12 11.0000000
# 2         JUB           1          15 12    12-12 12.0000000
# 3         JUB           1          15 13    13-12 13.0000000
# 7         JUB           2          15 16    16-12  4.0000000
# 8         JUB           2          15 17    17-12  5.0000000
# 9         JUB           2          15 20    20-12  8.0000000
# 13        JUB           3          15 25    25-12 13.0000000
# 14        JUB           3          15 22    22-12 10.0000000
# 15        JUB           3          15 24    24-12 12.0000000
# 4         JUB           1          20 15 15-15.66 15.0000000
# 5         JUB           1          20 18 18-15.66 18.0000000
# 6         JUB           1          20 14 14-15.66 14.0000000
# 10        JUB           2          20 21 21-15.66  5.3333333
# 11        JUB           2          20 19 19-15.66  3.3333333
# 12        JUB           2          20 16 16-15.66  0.3333333
# 16        JUB           3          20 26 26-15.66 10.3333333
# 17        JUB           3          20 24 24-15.66  8.3333333
# 18        JUB           3          20 23 23-15.66  7.3333333

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM