有没有比 dplyr 更有效的方法来获得很多列的方差？

Question

I have a data.frame that is >250,000 columns and 200 rows, so around 50 million individual values.我有一个 > 250,000 列和 200 行的 data.frame，所以大约有 5000 万个单独的值。 I am trying to get a breakdown of the variance of the columns in order to select the columns with the most variance.我正在尝试对列的方差进行细分，以便选择方差最大的列。

I am using dplyr as follows:我使用 dplyr 如下：

df %>% summarise_if(is.numeric, var)

It has been running on my imac with 16gb of RAM for about 8 hours now.它已经在我的 imac 上运行了大约 8 个小时，内存为 16GB。

Is there a way top allocate more resources to the call, or a more efficient way to summarise the variance across columns?有没有办法 top 为调用分配更多资源，或者有一种更有效的方法来汇总列之间的差异？

Answer 1

I bet that selecting the columns first, then calculating the variance, will be a lot faster:我敢打赌，先选择列，然后计算方差，会快很多：

df <- as.data.frame(matrix(runif(5e7), nrow = 200, ncol = 250000))

df_subset <- df[,sapply(df, is.numeric)]
sapply(df_subset, var)

The code above runs on my machine in about a second, and that's calculating the variance on every single column because they're all numeric in my example.上面的代码在我的机器上运行大约一秒钟，这是计算每一列的方差，因为在我的例子中它们都是数字。

Answer 2

Very wide data.frames are quite inefficient.非常宽的 data.frames 效率很低。 I think converting to a matrix and using matrixStats::colVars() would be the fastest.我认为转换为矩阵并使用matrixStats::colVars()将是最快的。

Answer 3

You may try using data.table which is usually faster.您可以尝试使用通常更快的data.table 。

library(data.table)

cols <- names(Filter(is.numeric, df))
setDT(df)
df[, lapply(.SD, var), .SDcols = cols]

Another approach you can try is getting the data in long format.您可以尝试的另一种方法是以长格式获取数据。

library(dplyr)
library(tidyr)

df %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything()) %>%
  group_by(name) %>%
  summarise(var_value = var(value))

but I agree with @Daniel V that it is worth checking the data as 8 hours is way too much time to perform this calculation.但我同意@Daniel V 的观点，即检查数据是值得的，因为 8 小时是执行此计算的太多时间。

有没有比 dplyr 更有效的方法来获得很多列的方差？

问题描述

3 个解决方案

解决方案1
1 2021-10-28 01:26:30

解决方案2
0 2021-10-28 01:25:01

解决方案3
0 2021-10-28 02:07:53

有没有比 dplyr 更有效的方法来获得很多列的方差？

问题描述

3 个解决方案

解决方案1 1 2021-10-28 01:26:30

解决方案2 0 2021-10-28 01:25:01

解决方案3 0 2021-10-28 02:07:53

解决方案1
1 2021-10-28 01:26:30

解决方案2
0 2021-10-28 01:25:01

解决方案3
0 2021-10-28 02:07:53