I have a data.frame that is >250,000 columns and 200 rows, so around 50 million individual values. I am trying to get a breakdown of the variance of the columns in order to select the columns with the most variance.
I am using dplyr as follows:
df %>% summarise_if(is.numeric, var)
It has been running on my imac with 16gb of RAM for about 8 hours now.
Is there a way top allocate more resources to the call, or a more efficient way to summarise the variance across columns?
I bet that selecting the columns first, then calculating the variance, will be a lot faster:
df <- as.data.frame(matrix(runif(5e7), nrow = 200, ncol = 250000))
df_subset <- df[,sapply(df, is.numeric)]
sapply(df_subset, var)
The code above runs on my machine in about a second, and that's calculating the variance on every single column because they're all numeric in my example.
Very wide data.frames are quite inefficient. I think converting to a matrix and using matrixStats::colVars()
would be the fastest.
You may try using data.table
which is usually faster.
library(data.table)
cols <- names(Filter(is.numeric, df))
setDT(df)
df[, lapply(.SD, var), .SDcols = cols]
Another approach you can try is getting the data in long format.
library(dplyr)
library(tidyr)
df %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything()) %>%
group_by(name) %>%
summarise(var_value = var(value))
but I agree with @Daniel V that it is worth checking the data as 8 hours is way too much time to perform this calculation.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.