简体   繁体   English

R 中重复行之间的平均值

[英]average between duplicated rows in R

I have a data frame df with rows that are duplicates for the names column but not for the values column:我有一个数据框df ,其中的行与名称列重复,但与值列不重复:

name    value   etc1    etc2
A       9       1       X
A       10      1       X
A       11      1       X
B       2       1       Y
C       40      1       Y
C       50      1       Y

I need to aggregate the duplicate names into one row, while calculating the mean over the values column.我需要将重复的名称聚合到一行中,同时计算值列的平均值。 The expected output is as follows:预期的output如下:

name    value   etc1    etc2
A       10      1       X
B       2       1       Y
C       45      1       Y

I have tried to use df[duplicated(df$name),] but of course this does not give me the mean over the duplicates.我曾尝试使用df[duplicated(df$name),]但当然这并没有给我重复项的平均值。 I would like to use aggregate() , but the problem is that the FUN part of this function will apply to all the other columns as well, and among other problems, it will not be able to compute char content.我想使用aggregate() ,但问题是这个 function 的 FUN 部分也适用于所有其他列,除其他问题外,它无法计算字符内容。 Since all the other columns have the same content over the "duplicates", I need them to be aggregated as is just like the name column.由于所有其他列在“重复项”上都具有相同的内容,因此我需要将它们聚合在一起,就像名称列一样。 Any hints...?任何提示...?

Here a data.table solution. 这是一个data.table解决方案。 The solution is general in the sense it will work even for a data.frame with 60 columns. 解决方案是通用的,即使对于具有60列的data.frame也是如此。 Since I group the data by all variables different of value( See how I create keys below) 因为我按所有不同值的变量对数据进行分组(请参阅下面的创建键)

library(data.table)
dat <- read.table(text='name    value   etc1    etc2
A       9       1       X
A       10      1       X
A       11      1       X
B       2       1       Y
C       40      1       Y
C       50      1       Y',header=TRUE)
keys <- colnames(dat)[!grepl('value',colnames(dat))]
X <- as.data.table(dat)
X[,list(mm= mean(value)),keys]
  name etc1 etc2 mm
1:    A    1    X 10
2:    B    1    Y  2
3:    C    1    Y 45

EDIT extend to more than one value variable 编辑扩展到多个变量

In case you have more than one numeric variables on which you want to compute the mean , For example, if your data look like this 如果您有多个要计算平均值的数值变量,例如,如果您的数据看起来像这样

  name value etc1 etc2     value1
1    A     9    1    X  2.1763485
2    A    10    1    X -0.7954326
3    A    11    1    X -0.5839844
4    B     2    1    Y -0.5188709
5    C    40    1    Y -0.8300233
6    C    50    1    Y -0.7787496

The above solution can be extended like this : 上述解决方案可以像这样扩展:

X[,lapply(.SD,mean),keys]
   name etc1 etc2 value     value1
1:    A    1    X    10  0.2656438
2:    B    1    Y     2 -0.5188709
3:    C    1    Y    45 -0.8043865

This will compute the mean for all variables that don't exist in keys list. 这将计算键列表中不存在的所有变量的均值。

您可以使用如下aggregate()函数:

aggregate(df$value,by=list(name=df$name,etc1=df$etc1,etc2=df$etc2),data=df,FUN=mean)

The code (written by Metrics) is almost working except in one place (.name). 代码(由Metrics编写)几乎可以工作,除了在一个地方(.name)。 I slightly modified it: 我稍微修改了一下:

sample<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A", 
    "B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L, 
    50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name", 
    "value", "etc1", "etc2"), class = "data.frame", row.names = c(NA, 
    -6L))

sample.m <- ddply(sample, 'name', summarize, value =mean(value), etc1=head(etc1,1), etc2=head(etc2,1))

sample.m
      name value etc1 etc2
    1    A    10    1    X
    2    B     2    1    Y
    3    C    45    1    Y

Assuming your dataframe is df. 假设您的数据帧是df。

install.packages("plyr")
library(plyr)



df<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A", 
    "B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L, 
    50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name", 
    "value", "etc1", "etc2"), class = "data.frame", row.names = c(NA, 
    -6L))

df.m<-ddply(df,.(name),summarize, value=mean(value),etc1=head(etc1,1),etc2=head(etc2,1))

df.m
 name value etc1 etc2
1    A      10    1    X
2    B       2    1    Y
3    C      45    1    Y

This simple one worked for me:这个简单的对我有用:

avg_data <- aggregate(. ~ name, df, mean) avg_data <- aggregate(. ~ name, df, mean)

Using the "aggregate" function: apply the formula method ( x ~ y ) for all variables (.) based on the naming variable ("name"), within the data.frame "df", to perform the "mean" function.使用“聚合”function:基于命名变量(“名称”),在数据框“df”中对所有变量(.)应用公式方法(x~y),以执行“平均”function。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM