简体   繁体   English

R-从数据框中该组的每个元素中减去该组的平均值

[英]R- Subtracting the mean of a group from each element of that group in a dataframe

I am trying to merge a vector 'means' to a dataframe. 我正在尝试将向量“均值”合并到数据帧。 My dataframe looks like this Data = growth 我的数据框看起来像这样数据=增长

I first calculated all the means for the different groups (1 group = population + temperature + size + replicat) using this command: 我首先使用以下命令计算了不同组(1组=人口+温度+大小+复制数)的所有均值:

means<-aggregate(TL ~ Population + Temperature + Replicat + Size + Measurement, data=growth, list=growth$Name, mean)        

Then, I selected the means for Measurement 1 as follows as I am only interested in these means. 然后,我选择测量1的方法如下,因为我只对这些方法感兴趣。

meansT0<-means[which(means$Measurement=="1"),]    

Now, I would like to merge this vector of means values to my dataframe (=growth) so that the right mean of each group corresponds to the right part of the dataframe. 现在,我想将此均值向量合并到我的数据框(=增长)中,以便每个组的正确均值对应于数据框的正确部分。

The goal is to then substrat the mean of each group (at Measurement 1) to each element of the dataframe based on its belonging group (and for all other Measurements except Measurement 1). 然后,目标是将每个组的平均值(在度量1中)基于其所属组(以及对于除度量1之外的所有其他度量)而言的数据框的每个元素。 Maybe there is no need to add the means column to the dataframe? 也许不需要在数据框中添加均值列? Do you know any command to do that ? 您知道执行此操作的命令吗?

[27.06.18] I made up this simplified dataframe, I hope this help understanding. [27.06.18]我组成了这个简化的数据框,希望对我有所帮助。 So, what I want is to substrat, for each individual in the dataframe and for each measurement (here only Measurement 1 and Measurement 2, normally I have more), the mean of its belongig group at MEASUREMENT 1. 因此,我要对数据框中的每个个体和每个度量(这里只有度量1和度量2,通常我有更多)进行求和,以其度量1的所属组的平均值为基础。

So, if I get the means by group ( 1 group = Population + Temperature + Measurement): 因此,如果我按组得到平均值( 1组 =人口+温度+测量值):

means<-aggregate(TL ~ Population + Temperature + Measurement, data=growth, list=growth$Name, mean)
means               

I got these values of means (in this example) : 我得到了均值的这些值(在此示例中):

Population Temperature Measurement       TL
JUB          15           **1**           **12.00000**
JUB          20           **1**           **15.66667**
JUB          15           2           17.66667
JUB          20           2           18.66667
JUB          15           3           23.66667
JUB          20           3           24.33333

We are only interested by the means at MEASUREMENT 1. For each individual in the dataframe, I want to substrat the mean of its belonging group at Measurement 1: in this example (see dataframe with R command): - for the group JUB+15+Measurement 1 , mean = 12 - for the group JUB+20+Measurement 1 , mean = 15.66 我们只对测量1中的方法感兴趣。对于数据框中的每个人,我想在测量1中将其所属组的平均值相乘(在此示例中,请参见带有R命令的数据框):- 对于JUB + 15组+测量1的平均值= 12- 对于JUB + 20组+测量1的平均值= 15.66

growth<-data.frame(Population=c("JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB"), Measurement=c("1","1","1","1","1","1","2","2","2","2","2","2", "3", "3", "3", "3", "3", "3"),Temperature=c("15","15","15","20", "20", "20","15","15","15","20", "20", "20","15","15","15","20", "20", "20"),TL=c(11,12,13,15,18,14, 16,17,20,21,19,16, 25,22,24,26,24,23), New_TL=c("11-12", "12-12", "13-12", "15-15.66", "18-15.66", "14-15.66", "16-12", "17-12", "20-12", "21-15.66", "19-15.66", "16-15.66", "25-12", "22-12", "24-12", "26-15.66", "24-15.66", "23-15.66"))    
print(growth)

I hope with this, you can understand better what I am trying to do. 我希望借此,您可以更好地了解我要做什么。 I have a lot of data and if I have to do this manually, this will take me a lot of time and increase the risk of me putting mistakes. 我有很多数据,如果必须手动执行此操作,这将花费我很多时间,并增加了我犯错误的风险。

Here is an option with tidyverse . 这是tidyverse一个选项。 After grouping by the group columns, use mutate_at specifying the columns of interest and get the difference of that column ( . ) with the mean of it. 在按组列分组后,使用mutate_at指定感兴趣的列,并获得该列( . )与其mean的差。

library(tidyverse)
growth %>% 
       group_by(Population, Temperature, Replicat, Size, Measurement) %>% 
       mutate_at(vars(HL, TL), funs(MeanGroupDiff = . 
                  - mean(.[Measurement == 1])))

Using a reproducible example with mtcars dataset mtcars数据集中使用可重现的示例

data(mtcars)
mtcars %>%
   group_by(cyl, vs) %>% 
   mutate_at(vars(mpg, disp), funs(MeanGroupDiff = .- mean(.[am==1])))

Have you considered using the data.table package? 您是否考虑过使用data.table包? It is very well suited for doing these kind of grouping, filtering, joining, and aggregation operations you describe, and might save you a great deal of time in the long run. 它非常适合执行您描述的此类分组,过滤,联接和聚合操作,从长远来看可能会节省大量时间。

The code below shows how a workflow similar to the one you described but based on the built in mtcars data set might look using data.table . 下面的代码显示了与您描述的工作流程类似但基于内置mtcars数据集的工作流程如何使用data.table看起来。

To be clear, there are also ways to do what you describe using base R as well as other packages like dplyr , just throwing out a suggestion based on what I have found the most useful for my personal work. 需要明确的是,还有一些方法可以使用base R以及其他软件包(如dplyr来完成您描述的dplyr ,只是根据我发现对个人工作最有用的建议提出建议。

library(data.table)

## Convert mtcars to a data.table
## only include columns `mpg`, `cyl`, `am` and `gear` for brevity
DT <- as.data.table(mtcars)[, .(mpg, cyl,am, gear)]

## Take a subset where `cyl` is equal to 6
DT <- DT[cyl == 6]

## Calculate grouped mean based on `gear` and `am` as grouping variables
DT[,group_mpg_avg := mean(mpg), keyby = .(gear, am)]

## Calculate each row's difference from the group mean
DT[,mpg_diff_from_group := mpg - group_mpg_avg]

print(DT)

#     mpg cyl am gear group_mpg_avg mpg_diff_from_group
# 1: 21.4   6  0    3         19.75                1.65
# 2: 18.1   6  0    3         19.75               -1.65
# 3: 19.2   6  0    4         18.50                0.70
# 4: 17.8   6  0    4         18.50               -0.70
# 5: 21.0   6  1    4         21.00                0.00
# 6: 21.0   6  1    4         21.00                0.00
# 7: 19.7   6  1    5         19.70                0.00

Consider by to subset your data frame by factors (but leave out Measurement in order to compare group 1 and all other groups). 考虑by因素对数据框架进行子集化(但为了比较第1组和所有其他组而省略了Measurement )。 Then, run an ifelse conditional logic calculation for needed columns. 然后,对所需的列运行ifelse条件逻辑计算。 Since by will return a list of data frames, bind all outside with do.call() : 由于by将返回数据帧列表,因此请使用do.call()将所有外部绑定:

df_list <- by(growth, growth[,c("Population", "Temperature")], function(sub) {
  # TL CORRECTION      
  sub$Correct_TL <- ifelse(sub$Measurement != 1, 
                           sub$TL - mean(subset(sub, Measurement == 1)$TL),
                           sub$TL)
  # ADD OTHER CORRECTIONS

  return(sub)  
})

final_df <- do.call(rbind, df_list)

Output (using posted data) 输出 (使用发布的数据)

final_df

#    Population Measurement Temperature TL   New_TL Correct_TL
# 1         JUB           1          15 11    11-12 11.0000000
# 2         JUB           1          15 12    12-12 12.0000000
# 3         JUB           1          15 13    13-12 13.0000000
# 7         JUB           2          15 16    16-12  4.0000000
# 8         JUB           2          15 17    17-12  5.0000000
# 9         JUB           2          15 20    20-12  8.0000000
# 13        JUB           3          15 25    25-12 13.0000000
# 14        JUB           3          15 22    22-12 10.0000000
# 15        JUB           3          15 24    24-12 12.0000000
# 4         JUB           1          20 15 15-15.66 15.0000000
# 5         JUB           1          20 18 18-15.66 18.0000000
# 6         JUB           1          20 14 14-15.66 14.0000000
# 10        JUB           2          20 21 21-15.66  5.3333333
# 11        JUB           2          20 19 19-15.66  3.3333333
# 12        JUB           2          20 16 16-15.66  0.3333333
# 16        JUB           3          20 26 26-15.66 10.3333333
# 17        JUB           3          20 24 24-15.66  8.3333333
# 18        JUB           3          20 23 23-15.66  7.3333333

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM