繁体   English   中英

R:跨组内的所有列对应用函数和/或回归(由时间定义)

[英]R: apply functions and/or regress over across all the pairs of columns within groups (defined by time)

数据集看起来像这样 第一列是时间(年-月)所有其他列都是公司 a、公司 b、公司 c 等的数值。下面是假数据

library(tidyverse)
data <- tibble(time = c(1,1,2,2), a = c(1,2,3,4), b =c(4,3,2,1), c = c(1,1,1,1))

该操作需要在每组时间内对每对列 a、b、c 应用函数和/或运行回归 应用任意函数(对两列求和)后,预期输出将如下所示

expected_output <- tibble(t = c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2),
firm1 = c("a","a","a","b","b","b","c","c","c","a","a","a","b","b","b","c","c","c"),
firm2 = c("a","b","c","a","b","c","a","b","c","a","b","c","a","b","c","a","b","c"),
value = c(6,10,5,10,14,9,5,9,4,14,10,9,10,6,5,9,5,4))
expected_output

该函数可以是具有两个输入的任意函数(公司 a、公司 b、公司 c 等的列对)

fun1<-function(x, y){
  sum(x, y)
}

我目前使用 tidyverse 的方法如下,感谢 @akrun 和 @Ronak Shah

创建所有对公司 a, b, c

tmp <- expand.grid(firm1 = names(data[-1]), firm2 = names(data[-1]))

result1 <- data %>%
group_split(time) %>%
map_df(~cbind(time = .x$time[1], tmp, value = apply(tmp, 1, function(x) fun1(.x[[x[1]]],.x[[x[2]]]))))
result1

问题是,使用真实数据,上面的代码运行速度很慢。 我想原因是'map_df'。 但我对这种数据分析很陌生。 真实数据有2000多列,带来超过400万对列。 由于 data.table 以其效率和精度而闻名。 我想知道 data.table 中是否有办法完成这样的操作?

即使不使用 data.table,这也应该快得多

library(tidyverse)
data <- tibble(time = c(1,1,2,2), a = c(1,2,3,4), b =c(4,3,2,1), c = c(1,1,1,1))
data
#> # A tibble: 4 × 4
#>    time     a     b     c
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     1     4     1
#> 2     1     2     3     1
#> 3     2     3     2     1
#> 4     2     4     1     1
fun1<-function(x, y){
  sum(x, y)
}

data_agg = data %>% 
  group_by(time) %>% 
  summarise(across(.cols = a:c, .fns = sum))
data_agg
#> # A tibble: 2 × 4
#>    time     a     b     c
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     3     7     2
#> 2     2     7     3     2

data_agg_long <- data_agg %>% 
  pivot_longer(a:c)
data_agg_long
#> # A tibble: 6 × 3
#>    time name  value
#>   <dbl> <chr> <dbl>
#> 1     1 a         3
#> 2     1 b         7
#> 3     1 c         2
#> 4     2 a         7
#> 5     2 b         3
#> 6     2 c         2

data_agg_long %>% 
  full_join(data_agg_long,
            by = "time") %>% 
  # might not be necessary if the function is vectorised
  rowwise() %>% 
  mutate(value = fun1(value.x, value.y)) %>% 
  select(time, firm1 = name.x, firm2 = name.y, value)
#> # A tibble: 18 × 4
#> # Rowwise: 
#>     time firm1 firm2 value
#>    <dbl> <chr> <chr> <dbl>
#>  1     1 a     a         6
#>  2     1 a     b        10
#>  3     1 a     c         5
#>  4     1 b     a        10
#>  5     1 b     b        14
#>  6     1 b     c         9
#>  7     1 c     a         5
#>  8     1 c     b         9
#>  9     1 c     c         4
#> 10     2 a     a        14
#> 11     2 a     b        10
#> 12     2 a     c         9
#> 13     2 b     a        10
#> 14     2 b     b         6
#> 15     2 b     c         5
#> 16     2 c     a         9
#> 17     2 c     b         5
#> 18     2 c     c         4

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM