[英]How to aggregate two different columns with two different functions in R dataframe
I have a data frame that has some records that are duplicated and I need to aggregate the duplicates so there is a unique record per row.我有一个数据框,其中有一些重复的记录,我需要聚合重复项,以便每行有一个唯一的记录。
An example:一个例子:
Col1 Col2 Col3 Col4
A 0.170 83 0.878
B 0.939 103 0.869
C 0.228 80 0.935
D 0.566 169 0.851
D 0.566 137 0.588
E 0.703 103 0.636
I need to weight the average of Col4 with Col3, and sum Col3.我需要用 Col3 加权 Col4 的平均值,并对 Col3 求和。 So my result would be:所以我的结果是:
Col1 Col2 Col3 Col4
A 0.17 83 0.878
B 0.939 103 0.869
C 0.228 80 0.935
D 0.566 306 0.733
E 0.703 103 0.636
Usually I would use the aggregate function but I can't seem to find a solution to include two different function types.通常我会使用聚合函数,但我似乎无法找到包含两种不同函数类型的解决方案。 Is there another way I can accomplish this?有没有另一种方法可以做到这一点? I am effectively ignoring Col 2 since the granularity before merging with the data that brought in Col3 and Col4 was one record per row, and now it is being duplicated.我实际上忽略了 Col 2,因为在与引入 Col3 和 Col4 的数据合并之前的粒度是每行一条记录,现在它正在被复制。
Thank you!!谢谢!!
Using dplyr
, you can use group_by
to keep all unique rows of "Col1" and then pass all your different function into summarise
.使用dplyr
,您可以使用group_by
保持“Col1中”的所有独特的行,然后通过所有的不同的功能分为summarise
。 With your example, it can be something like that.以你的例子,它可以是这样的。
NB: To calculate weighted.mean
of Col4 by Col3, you need to pass this function before calculating the sum
of Col3, otherwise length of Col4 and Col3 will differ.注:计算weighted.mean
通过COL3 COL4的,你需要计算前通过此功能sum
COL3,否则长度COL4和COL3的会有所不同。
You can then reorganize your dataframe in the correct order using select
:然后,您可以使用select
以正确的顺序重新组织数据框:
library(dplyr)
df %>% group_by(Col1) %>%
summarise(Col2 = mean(Col2),
Col4 = weighted.mean(Col4,Col3),
Col3 = sum(Col3)) %>%
select(Col1,Col2,Col3,Col4)
# A tibble: 5 x 4
Col1 Col2 Col3 Col4
<chr> <dbl> <int> <dbl>
1 A 0.17 83 0.878
2 B 0.939 103 0.869
3 C 0.228 80 0.935
4 D 0.566 306 0.733
5 E 0.703 103 0.636
Data数据
structure(list(Col1 = c("A", "B", "C", "D", "D", "E"), Col2 = c(0.17,
0.939, 0.228, 0.566, 0.566, 0.703), Col3 = c(83L, 103L, 80L,
169L, 137L, 103L), Col4 = c(0.878, 0.869, 0.935, 0.851, 0.588,
0.636)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x561706072cc0>)
Base R solution:基础 R 解决方案:
aggregated_df <- data.frame(do.call("rbind", lapply(split(df, df$Col1), function(x){
list(Col1 = unique(x$Col1), Col2 = mean(x$Col2), Col3 = sum(x$Col3),
Col4 = weighted.mean(x$Col4, x$Col3))
}
)
),
stringsAsFactors = FALSE)
Data:数据:
df <-
structure(
list(
Col1 = c("A", "B", "C", "D", "D", "E"),
Col2 = c(0.17,
0.939, 0.228, 0.566, 0.566, 0.703),
Col3 = c(83L, 103L, 80L,
169L, 137L, 103L),
Col4 = c(0.878, 0.869, 0.935, 0.851, 0.588,
0.636)
),
row.names = c(NA,-6L),
class = c("data.frame"
))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.