[英]How to calculate ratios and normalized ratios in all possible combinations in R?
[英]How to calculate normalized ratios in all possible combinations efficiently for a large matrix in R?
我想為 R 中的大矩陣有效地計算所有可能組合中的歸一化比率。 我在這里早些時候問過一個類似的問題,並且數據很少,那里提供的解決方案運行良好。 但是,當我嘗試對大型數據集 (400 x 2151) 應用相同的解決方案時,我的系統就會掛起。 我的系統有 16 GB RAM 和 Intel i7 處理器。 這是帶有數據的代碼
df <- matrix(rexp(860400), nrow = 400, ncol = 2151)
@Ronak Shah 提供的解決方案
cols <- 1:ncol(df)
temp <- expand.grid(cols, cols)
new_data <- (df[,temp[,2]] - df[,temp[,1]])/(df[,temp[,2]] + df[,temp[,1]])
或@akrun 提供的以下解決方案
f1 <- function(i, j) (df[, i] - df[, j])/(df[, i] + df[, j])
out <- outer(seq_along(df), seq_along(df), FUN = f1)
colnames(out) <- outer(names(df), names(df), paste, sep = "_")
這兩種解決方案都需要很長時間,而且系統正在掛起。 那么,我怎樣才能有效地做到這一點呢?
更新
更新預期的 output
library(tidyverse)
#Fake dataset
df = structure(list(var_1 = c(0.035, 0.047, 0.004, 0.011, 0.01, 0.01, 0.024),
var_2 = c(0.034, 0.047, 0.004, 0.012, 0.01, 0.011, 0.025),
var_3 = c(0.034, 0.047, 0.006, 0.013, 0.011, 0.013, 0.026),
var_4 = c(0.034, 0.046, 0.008, 0.016, 0.014, 0.015, 0.028),
var_5 = c(0.034, 0.046, 0.009, 0.017, 0.015, 0.016, 0.029)),
class = "data.frame", row.names = c(NA, -7L))
cols <- 1:ncol(df)
mat_out <- do.call(cbind, lapply(cols, function(xj)
sapply(cols, function(xi) (df[, xj] - df[, xi])/(df[, xj] + df[, xi]))))
colnames(mat_out) <- outer(names(df), names(df), paste, sep = ",")
RWC <- read.table(text = "s_no RWC
1 95.512
2 97.9
3 92.897
4 94.209
5 87.472
6 91.109
7 92.83", header = T)
mat_out %>% as.data.frame() %>%
mutate(id = row_number()) %>%
left_join(RWC, by = c("id" = "s_no")) %>%
pivot_longer(cols = -c(RWC, id)) %>%
group_by(name) %>%
mutate(correl = cor(value, RWC, use = "complete.obs")) %>%
distinct(name, .keep_all = TRUE) %>%
separate(name, c("Wav1", "Wav2"), sep = ",") %>%
select(-c("id", "RWC", "value")) %>%
pivot_wider(names_from = Wav2, values_from = correl)
#> # A tibble: 5 × 6
#> Wav1 var_1 var_2 var_3 var_4 var_5
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 var_1 NA -0.190 -0.358 -0.537 -0.551
#> 2 var_2 0.190 NA -0.322 -0.528 -0.544
#> 3 var_3 0.358 0.322 NA -0.682 -0.667
#> 4 var_4 0.537 0.528 0.682 NA -0.595
#> 5 var_5 0.551 0.544 0.667 0.595 NA
由於 memory 似乎是您的主要問題,使用迭代器怎么樣? 使用 package RcppAlgos
* ,我們可以使用permuteIter
計算您的比率N。
如果必須有名稱,我們需要一個額外的迭代器。 這意味着您必須保持 2 個迭代器同步,這可能會變得乏味。 幸運的是,使用permuteIter
的summary()
方法,我們總能看到當前索引是什么,並使用各種選項(例如隨機訪問[[
、 front()
、 back()
或startOver()
)重置它們。
library(RcppAlgos)
df <- matrix(rexp(860400), nrow = 400, ncol = 2151)
ratioIter <- permuteIter(ncol(df), 2, FUN = function(x) {
(df[, x[2]] - df[, x[1]]) / (df[, x[2]] + df[, x[1]])
})
## if you really want to name your output, you must have
## an additional name iterator... not very elegant
nameIter <- permuteIter(paste0("col", 1:ncol(df1)), 2, FUN = function(x) {
paste0(rev(x), collapse = "_")
})
firstIter <- matrix(ratioIter$nextIter(), ncol = 1)
firstName <- nameIter$nextIter()
colnames(firstIter) <- firstName
head(firstIter)
col2_col1
[1,] 0.2990054
[2,] -0.9808111
[3,] -0.9041054
[4,] 0.7970873
[5,] 0.8625776
[6,] 0.2768359
## returns a list, so we call do.call(cbind
next5Iter <- do.call(cbind, ratioIter$nextNIter(5))
next5Names <- unlist(nameIter$nextNIter(5))
colnames(next5Iter) <- next5Names
head(next5Iter)
col3_col1 col4_col1 col5_col1 col6_col1 col7_col1
[1,] -0.28099710 0.1665687 0.40565958 -0.7524038 -0.7132844
[2,] -0.81434900 -0.4283759 -0.89811556 -0.8462906 -0.5399741
[3,] -0.02289368 0.4285012 0.05087853 -0.5091659 -0.2328995
[4,] -0.06825458 0.3126928 0.68968843 -0.2180618 0.6651785
[5,] 0.33508319 0.7389108 0.84733425 0.9065263 0.8977107
[6,] 0.61773589 0.3443120 0.61084584 0.5727938 0.3888807
您應該注意,這不會顯示i == j
的結果(這些給出NaN
)。 所以總數略低於 2151 2 (實際上它正好等於2151^2 - 2151
)。
ratioIter$summary()
$description
[1] "Permutations of 2151 choose 2"
$currentIndex
[1] 6
$totalResults
[1] 4624650
$totalRemaining
[1] 4624644
甚至還有隨機訪問和以前的迭代器:
## Get the last ratio
lastIter <- ratioIter$back()
lastName <- nameIter$back()
mLast <- matrix(lastIter, ncol = 1)
colnames(mLast) <- lastName
head(mLast)
col2150_col2151
[1,] -0.6131926
[2,] 0.9936783
[3,] 0.1373538
[4,] 0.1014347
[5,] -0.5061608
[6,] 0.5773503
## iterate backwards with the previous methods
prev5Iter <- do.call(cbind, ratioIter$prevNIter(5))
prev5Names <- unlist(nameIter$prevNIter(5))
colnames(prev5Iter) <- prev5Names
head(prev5Iter)
col2149_col2151 col2148_col2151 col2147_col2151 col2146_col2151 col2145_col2151
[1,] -0.75500069 -0.72757136 -0.94457988 -0.82858884 -0.25398782
[2,] 0.99696694 0.99674084 0.99778638 0.99826472 0.95738947
[3,] 0.27701596 0.45696010 0.00682574 0.01529448 -0.62368764
[4,] -0.09508689 -0.90698165 -0.38221934 -0.41405984 0.01371556
[5,] -0.31580709 -0.06561386 -0.07435058 -0.08033145 -0.90692881
[6,] 0.82697720 0.86858595 0.81707206 0.75627297 0.46272349
## Get a random sample
set.seed(123)
randomIter <- do.call(cbind, ratioIter[[sample(4624650, 5)]])
## We must reset the seed in order to get the same output for the names
set.seed(123)
randomNames <- unlist(nameIter[[sample(4624650, 5)]])
colnames(randomIter) <- randomNames
head(randomIter)
col1044_col939 col20_col1552 col412_col2014 col1751_col1521 col337_col1295
[1,] -0.3902066 0.4482747 -0.108018200 -0.1662857 -0.3822436
[2,] -0.2358101 0.9266657 -0.657135882 0.0671608 -0.6821823
[3,] -0.7054217 0.8944720 0.092363665 0.2667708 0.1908249
[4,] -0.1574657 0.2775225 -0.221737223 0.3381454 -0.5705021
[5,] -0.4282909 -0.4406433 0.092783086 -0.7506674 -0.1276932
[6,] 0.9998189 -0.2497586 -0.009375891 0.7071864 -0.2425258
最后,它是用C++
編寫的,所以速度非常快:
system.time(ratioIter$nextNIter(1e3))
# user system elapsed
# 0 0 0
*我是RcppAlgos
的作者
您可以使用rcpp
使您的代碼更快:
Rcpp::cppFunction("
std::vector<double> my_fun(arma::mat& x, arma::vec& y){
int p = x.n_cols - 1;
std::vector<double> result;
for(int i = 0; i<p; i++){
auto m = (x.cols(i+1, p).each_col() - x.col(i));
m /= (x.cols(i+1, p).each_col() + x.col(i));
auto a = arma::conv_to<std::vector<double>>::from(arma::cor(m, y));
result.insert(result.end(), a.begin(), a.end());
}
return result;
}", 'RcppArmadillo')
my_fun(df, y) # takes approximately 14seconds.
您可以使用 STL 函數使其更快。 雖然代碼會更長。 在我的電腦上這需要 6 秒
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.