简体   繁体   English

使用嵌套数据框访问purrr :: map()中的分组变量

[英]Accessing grouping variables in purrr::map() with nested dataframes

I'm using tidyr::nest() in combination with purrr::map() (-family) to group a data.frame into groups and then do some fancy stuff with each subset. 我将tidyr::nest()purrr::map() (-family)结合使用,将data.frame分为几组,然后对每个子集做一些花哨的东西。 Consider following example, and please ignore the fact that I don't need nest() and map() to do this (this is an oversimplified example): 考虑下面的示例, 请忽略以下事实:我不需要nest()map()来执行此操作 (这是一个过于简化的示例):

library(dplyr)
library(purrr)
library(tidyr)

mtcars %>% 
  group_by(cyl) %>%
  nest() %>%
  mutate(
    wt_mean = map_dbl(data,~mean(.x$wt))
  )

# A tibble: 8 x 4
    cyl  gear data               cly2
  <dbl> <dbl> <list>            <dbl>
1     6     4 <tibble [4 x 9]>      6
2     4     4 <tibble [8 x 9]>      4
3     6     3 <tibble [2 x 9]>      6
4     8     3 <tibble [12 x 9]>     8
5     4     3 <tibble [1 x 9]>      4
6     4     5 <tibble [2 x 9]>      4
7     8     5 <tibble [2 x 9]>      8
8     6     5 <tibble [1 x 9]>      6

Usually when I do this type of operation, I need access to the grouping variable ( cyl in this case) within map() . 通常,当我执行这种类型的操作时,需要访问map()的分组变量(在这种情况下为cyl map() But these grouping variables appear as vectors with length corresponding to the number of rows in the nested dataframe, and therefore don't lend themselves easily. 但是这些分组变量显示为向量,其长度与嵌套数据框中的行数相对应,因此不容易使用。

Is there a way I could run the following operation? 有没有办法可以执行以下操作? I would want the mean of wt to be divided by the number of cylinders ( cyl ) per group (ie row). 我希望将wt的平均值除以每组 (即行)的圆柱数( cyl )。

mtcars %>% 
  group_by(cyl,gear) %>%
  nest() %>%
  mutate(
    wt_mean = map_dbl(data,~mean(.x$wt)/cyl)
  )


Error in mutate_impl(.data, dots) : 
  Evaluation error: Result 1 is not a length 1 atomic vector.

Take cyl out of the map call: map通话中删除cyl

mtcars %>% 
  group_by(cyl,gear) %>%
  nest() %>%
  mutate(
    wt_mean = map_dbl(data, ~mean(.x$wt)) / cyl
  )

# A tibble: 8 x 4
    cyl  gear data              wt_mean
  <dbl> <dbl> <list>              <dbl>
1     6     4 <tibble [4 x 9]>    0.516
2     4     4 <tibble [8 x 9]>    0.595
3     6     3 <tibble [2 x 9]>    0.556
4     8     3 <tibble [12 x 9]>   0.513
5     4     3 <tibble [1 x 9]>    0.616
6     4     5 <tibble [2 x 9]>    0.457
7     8     5 <tibble [2 x 9]>    0.421
8     6     5 <tibble [1 x 9]>    0.462

map_dbl sees cyl as a length 8 vector because nest removes groups from data.frame . map_dblcyl视为长度为8的向量,因为nestdata.frame删除了组。 Using cyl in map_* function call (as in OP's example) results in 8 length-8 vectors. map_*函数调用中使用cyl (如OP的示例)会产生8个长度为8的向量。

2 other approaches: 其他2种方法:

Both with same result as above, but keep the grouped variables in the map_* call, per OP's specs: 两者都具有与上述相同的结果,但根据OP的规范,将分组的变量保留在map_*调用中:

Re grouping after nest nest后重新分组

mtcars %>% 
  group_by(cyl,gear) %>%
  nest() %>%
  group_by(cyl, gear) %>%
  mutate(wt_mean = map_dbl(data,~mean(.x$wt)/cyl))

map2 for iterating over cyl map2用于遍历cyl

mtcars %>% 
  group_by(cyl,gear) %>%
  nest() %>%
  mutate(wt_mean = map2_dbl(data, cyl,~mean(.x$wt)/ .y))

In the new release of dplyr 0-8-0 , you can now use group_map , which I find very handy for this use case. dplyr 0-8-0的新版本中,您现在可以使用group_map ,对于这种用例,我发现它非常方便。 This is the example by github user @yutannihilation 这是github用户@yutannihilation 的示例

library(dplyr, warn.conflicts = FALSE)

mtcars %>% 
  group_by(cyl) %>%
  group_map(function(data, group_info) {
    tibble::tibble(wt_mean = mean(data$wt) / group_info$cyl)
  })

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM