I'm relatively new to R and am working on a project where I need to calculate a LOT of column means and standard deviations. I have a dataset called scores that has over 3 million observations of 172 variables. I need to transform each of these scores by subtracting a mean and dividing a standard deviation. I am able to do what I want with my code below, but it takes up all of the memory in my R session (which is 50GB.). This step (calculating means and sds and transforming values) is the most memory-expensive step in my code and I am wondering if there is anything I can do lessen it? Would a function help? Should I store my data differently? Or does it take the same amount of power to do the math regardless of how you ask?
I am trying to avoid paying for a remote machine with more power if possible.
correct_scores <- TRUE
if (correct_scores){
# pull score data from larger database
scores <- noise[["i_scores"]][["whole_dataset"]][,-c(1:4)]
# calculate means and sds
meanofmeans <- mean(apply(scores, 2, mean))
meanofsds <- mean(apply(scores, 2, sd))
# do the thing
scores <- (scores - meanofmeans) / meanofsds
# put values back into larger database
noise[[ "i_scores-cor" ]][["whole_dataset"]] <- cbind(noise[["i_scores"]][["whole_dataset"]][,c(1:4)],scores)
}
a tiny bit of reproducible code from the scores dataset:
scores <- data.frame(ENCFF802ZBQ = c(34.80, -0.01, 0.248, 0.54),
ENCFF477IRE = c(0.32, 0.24, -0.24, 23.01),
ENCFF127IJN = c(0.23, 0.56, 0.01, 0.01))
Thanks!!
Given your example:
library(data.table)
setDT(scores)[, lapply(.SD, scale)]
setDT(scores)
converts scores
to a data.table
. lapply(.SD, scale)
applies the scale(...)
function to each column in scores
( .SD
is a shorthand in data.table
for "subset of columns"). In this case the subset is all columns. See ?data.table
for more information.
To your question: Should I store my data differently? Yes absolutely. But I'd need to see the structure of noise
and perhaps how/why you import it that way to comment further.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.