简体   繁体   English

在 dplyr package 中使用汇总和交叉,同时区分数字和非数字列

[英]Using summarise and across in the dplyr package while distinguishing between numeric and non-numeric columns

I would like to perform some operations using dplyr on a dataset that looks like:我想在如下所示的数据集上使用dplyr执行一些操作:

data <- data.frame(day = c(rep(1, 15), rep(2, 15)), nweek = rep(rep(1:5, 3),2), 
                   firm = rep(sapply(letters[1:3], function(x) rep(x, 5)), 2), 
                   quant = rnorm(30), price = runif(30) )

where each observation is at the day, week and firm level (there're only 2 days in a week).每个观察都在日、周和公司级别(一周只有 2 天)。

I would like to summarise the data (grouping by firm ) by (1) taking average across the days of the week across variables that are numeric (ie, quant and price ), and to take the first entry for variables that are not numeric (in this case it is only firm , but in my real dataset I have multiple variables that are not numeric ( Date and character ) and they may change within a week ( nweek ), so I would like to take only the entry in the first day of the week for all the non-numeric variables.我想通过 (1) 对numeric变量(即quantpriceacross一周中的几天取平均值来总结数据(按firm分组),并为非数字变量取第一个条目(在这种情况下,它只是firm的,但在我的真实数据集中,我有多个不是数字的变量( Datecharacter ),它们可能会在一周内发生变化( nweek ),所以我只想在第一天输入所有非数字变量的一周。

I tried using summarise and across but get an error我尝试使用summariseacross但得到一个错误

> data %>% group_by(firm, nweek) %>% dplyr::summarise(across(which(sapply(data, is.numeric)), ~ mean(.x, na.rm = TRUE)),
+                           across(which(sapply(data, !(is.numeric))), ~ head(.x, 1))
+ )
Error: Problem with `summarise()` input `..2`.
x invalid argument type
ℹ Input `..2` is `across(which(sapply(data, !(is.numeric))), ~head(.x, 1))`.
Run `rlang::last_error()` to see where the error occurred.

Any help?有什么帮助吗?

I don't know what your expected output should look like, but something like this could reach what you are trying to achieve我不知道您期望的 output 应该是什么样子,但是这样的事情可能会达到您想要实现的目标

data %>%
  group_by(firm, nweek) %>% 
  summarise(
    across(where(is.numeric), ~ mean(.x, na.rm = TRUE)),
    across(!where(is.numeric), ~ head(.x, 1))
)

As a sidenote, instead of using which(sapply(...)) , have a look at the where helper for conditional selection of variables inside across in this post .作为旁注,不要使用which(sapply(...)) ,而是查看这篇文章中用于条件选择变量across where助手。

Output Output

# A tibble: 15 x 5
# Groups:   firm [3]
   firm  nweek   day   quant price
   <chr> <int> <dbl>   <dbl> <dbl>
 1 a         1   1.5 -0.336  0.903
 2 a         2   1.5  0.0837 0.579
 3 a         3   1.5  0.0541 0.425
 4 a         4   1.5  1.21   0.555
 5 a         5   1.5  0.462  0.806
 6 b         1   1.5  0.0493 0.346
 7 b         2   1.5  0.635  0.596
 8 b         3   1.5  0.406  0.583
 9 b         4   1.5 -0.707  0.205
10 b         5   1.5  0.157  0.816
11 c         1   1.5  0.728  0.271
12 c         2   1.5  0.117  0.775
13 c         3   1.5 -1.05   0.234
14 c         4   1.5 -1.35   0.290
15 c         5   1.5  0.771  0.310

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM