简体   繁体   English

应用 group_by 并汇总数据,同时保留所有列的信息

[英]Applying group_by and summarise on data while keeping all the columns' info

I have a large dataset with 22000 rows and 25 columns.我有一个包含 22000 行和 25 列的大型数据集。 I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset.我正在尝试根据其中一列对我的数据集进行分组,并根据分组的数据集获取另一列的最小值。 However, the problem is that it only gives me two columns containing the grouped column and the column having the min value... but I need all the information of other columns related to the rows with the min values.但是,问题是它只给了我两列,其中包含分组列和具有最小值的列......但我需要与具有最小值的行相关的其他列的所有信息。 Here is a simple example just to make it reproducible:这是一个简单的示例,只是为了使其可重现:

    data<- data.frame(a=1:10, b=c("a","a","a","b","b","c","c","d","d","d"), c=c(1.2, 2.2, 2.4, 1.7, 2.7, 3.1, 3.2, 4.2, 3.3, 2.2), d= c("small", "med", "larg", "larg", "larg", "med", "small", "small", "small", "med"))

    d<- data %>%
    group_by(b) %>%
    summarise(min_values= min(c))
    d
    b min_values
    1 a        1.2
    2 b        1.7
    3 c        3.1
    4 d        2.2

So, I need to have also the information related to columns a and d, however, since I have duplications in the values in column c I cannot merge them based on the min_value column... I was wondering if there is any way to keep other columns' information when we are using dplyr package.因此,我还需要与 a 列和 d 列相关的信息,但是,由于我在 c 列的值中有重复,我无法根据 min_value 列合并它们...我想知道是否有任何方法可以保留当我们使用 dplyr 包时,其他列的信息。

I have found some explanation here " dplyr: group_by, subset and summarise " and here " Finding percentage in a sub-group using group_by and summarise " but none of the addresses my problem.我在这里找到了一些解释“ dplyr: group_by, subset and summarize ”和“ Finding percentage in a sub-group using group_by and summarize ”,但都没有解决我的问题。

Here are two options using a) filter and b) slice from dplyr. 以下是使用a) filter和b)来自dplyr的slice两个选项。 In this case there are no duplicated minimum values in column c for any of the groups and so the results of a) and b) are the same. 在这种情况下,对于任何组,列c中没有重复的最小值,因此a)和b)的结果是相同的。 If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group. 如果重复的最小值,方法一)将返回每个最小值每组而b)中仅将各组中返回一个最小的(第一)。

a) 一个)

> data %>% group_by(b) %>% filter(c == min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

Or similarly 或者类似的

> data %>% group_by(b) %>% filter(min_rank(c) == 1L)
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

b) b)

> data %>% group_by(b) %>% slice(which.min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

You can use group_by without summarize : 您可以使用group_by而不进行summarize

data %>%
  group_by(b) %>%
  mutate(min_values = min(c)) %>%
  ungroup()

Using sqldf : 使用sqldf

library(sqldf)
 # Two options:
sqldf('SELECT * FROM data GROUP BY b HAVING min(c)')
sqldf('SELECT a, b, min(c) min, d FROM data GROUP BY b')

Output: 输出:

   a b   c     d
1  1 a 1.2 small
2  4 b 1.7  larg
3  6 c 3.1   med
4 10 d 2.2   med

With dplyr 1.1.0 , you can use .by in mutate , summarize , filter and slice to do temporary grouping.使用dplyr 1.1.0 ,您可以在mutatesummarizefilterslice中使用.by来进行临时分组。 With mutate , all rows and columns are kept:使用mutate ,保留所有行和列:

data %>% 
  mutate(min_values = min(c), .by = b)

With filter , or slice , rows are summarized and all columns are kept:使用filterslice ,汇总行并保留所有列:

data %>% 
  slice_min(c, .by = b)

data %>% 
  filter(c = min(c), .by = b)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM