简体   繁体   English

如何在R数据框中使用两个不同的函数聚合两个不同的列

[英]How to aggregate two different columns with two different functions in R dataframe

I have a data frame that has some records that are duplicated and I need to aggregate the duplicates so there is a unique record per row.我有一个数据框,其中有一些重复的记录,我需要聚合重复项,以便每行有一个唯一的记录。

An example:一个例子:

Col1    Col2    Col3    Col4
A       0.170   83     0.878
B       0.939   103    0.869
C       0.228   80     0.935
D       0.566   169    0.851
D       0.566   137    0.588
E       0.703   103    0.636

I need to weight the average of Col4 with Col3, and sum Col3.我需要用 Col3 加权 Col4 的平均值,并对 Col3 求和。 So my result would be:所以我的结果是:

Col1    Col2    Col3    Col4
A      0.17     83     0.878
B      0.939    103    0.869
C      0.228    80     0.935
D      0.566    306    0.733
E      0.703    103    0.636

Usually I would use the aggregate function but I can't seem to find a solution to include two different function types.通常我会使用聚合函数,但我似乎无法找到包含两种不同函数类型的解决方案。 Is there another way I can accomplish this?有没有另一种方法可以做到这一点? I am effectively ignoring Col 2 since the granularity before merging with the data that brought in Col3 and Col4 was one record per row, and now it is being duplicated.我实际上忽略了 Col 2,因为在与引入 Col3 和 Col4 的数据合并之前的粒度是每行一条记录,现在它正在被复制。

Thank you!!谢谢!!

Using dplyr , you can use group_by to keep all unique rows of "Col1" and then pass all your different function into summarise .使用dplyr ,您可以使用group_by保持“Col1中”的所有独特的行,然后通过所有的不同的功能分为summarise With your example, it can be something like that.以你的例子,它可以是这样的。

NB: To calculate weighted.mean of Col4 by Col3, you need to pass this function before calculating the sum of Col3, otherwise length of Col4 and Col3 will differ.注:计算weighted.mean通过COL3 COL4的,你需要计算前通过此功能sum COL3,否则长度COL4和COL3的会有所不同。

You can then reorganize your dataframe in the correct order using select :然后,您可以使用select以正确的顺序重新组织数据框:

library(dplyr)
df %>% group_by(Col1) %>%
  summarise(Col2 = mean(Col2),
            Col4 = weighted.mean(Col4,Col3),
            Col3 = sum(Col3)) %>%
  select(Col1,Col2,Col3,Col4)

# A tibble: 5 x 4
  Col1   Col2  Col3  Col4
  <chr> <dbl> <int> <dbl>
1 A     0.17     83 0.878
2 B     0.939   103 0.869
3 C     0.228    80 0.935
4 D     0.566   306 0.733
5 E     0.703   103 0.636

Data数据

structure(list(Col1 = c("A", "B", "C", "D", "D", "E"), Col2 = c(0.17, 
0.939, 0.228, 0.566, 0.566, 0.703), Col3 = c(83L, 103L, 80L, 
169L, 137L, 103L), Col4 = c(0.878, 0.869, 0.935, 0.851, 0.588, 
0.636)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x561706072cc0>)

Base R solution:基础 R 解决方案:

aggregated_df <- data.frame(do.call("rbind", lapply(split(df, df$Col1), function(x){
        list(Col1 = unique(x$Col1), Col2 = mean(x$Col2), Col3 = sum(x$Col3), 
                   Col4 = weighted.mean(x$Col4, x$Col3))
      }
    )
  ),
stringsAsFactors = FALSE)

Data:数据:

df <-
  structure(
    list(
      Col1 = c("A", "B", "C", "D", "D", "E"),
      Col2 = c(0.17,
               0.939, 0.228, 0.566, 0.566, 0.703),
      Col3 = c(83L, 103L, 80L,
               169L, 137L, 103L),
      Col4 = c(0.878, 0.869, 0.935, 0.851, 0.588,
               0.636)
    ),
    row.names = c(NA,-6L),
    class = c("data.frame"
    ))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM