簡體   English   中英

對於 R 中的大矩陣,如何有效地計算所有可能組合中的歸一化比率?

[英]How to calculate normalized ratios in all possible combinations efficiently for a large matrix in R?

我想為 R 中的大矩陣有效地計算所有可能組合中的歸一化比率。 我在這里早些時候問過一個類似的問題,並且數據很少,那里提供的解決方案運行良好。 但是,當我嘗試對大型數據集 (400 x 2151) 應用相同的解決方案時,我的系統就會掛起。 我的系統有 16 GB RAM 和 Intel i7 處理器。 這是帶有數據的代碼

df <- matrix(rexp(860400), nrow = 400, ncol = 2151)

@Ronak Shah 提供的解決方案

cols <- 1:ncol(df)
temp <- expand.grid(cols, cols)
new_data <- (df[,temp[,2]] - df[,temp[,1]])/(df[,temp[,2]] + df[,temp[,1]])

或@akrun 提供的以下解決方案

f1 <- function(i, j) (df[, i] - df[, j])/(df[, i] + df[, j])
out <- outer(seq_along(df), seq_along(df), FUN = f1)
colnames(out) <- outer(names(df), names(df), paste, sep = "_")

這兩種解決方案都需要很長時間,而且系統正在掛起。 那么,我怎樣才能有效地做到這一點呢?

更新

更新預期的 output

library(tidyverse)

#Fake dataset
df = structure(list(var_1 = c(0.035, 0.047, 0.004, 0.011, 0.01, 0.01, 0.024), 
                    var_2 = c(0.034, 0.047, 0.004, 0.012, 0.01, 0.011, 0.025), 
                    var_3 = c(0.034, 0.047, 0.006, 0.013, 0.011, 0.013, 0.026), 
                    var_4 = c(0.034, 0.046, 0.008, 0.016, 0.014, 0.015, 0.028), 
                    var_5 = c(0.034, 0.046, 0.009, 0.017, 0.015, 0.016, 0.029)), 
               class = "data.frame", row.names = c(NA, -7L))

cols <- 1:ncol(df)
mat_out <- do.call(cbind, lapply(cols, function(xj) 
  sapply(cols, function(xi) (df[, xj] - df[, xi])/(df[, xj] + df[, xi]))))

colnames(mat_out) <-  outer(names(df), names(df), paste, sep = ",")

RWC <- read.table(text = "s_no  RWC
1   95.512
2   97.9
3   92.897
4   94.209
5   87.472
6   91.109
7   92.83", header = T)

mat_out %>% as.data.frame() %>% 
  mutate(id = row_number()) %>% 
  left_join(RWC, by = c("id" = "s_no")) %>% 
  pivot_longer(cols = -c(RWC, id)) %>% 
  group_by(name) %>% 
  mutate(correl = cor(value, RWC, use = "complete.obs")) %>% 
  distinct(name, .keep_all = TRUE) %>% 
  separate(name, c("Wav1", "Wav2"), sep = ",") %>% 
  select(-c("id", "RWC", "value")) %>% 
  pivot_wider(names_from = Wav2, values_from = correl)

#> # A tibble: 5 × 6
#>   Wav1   var_1  var_2  var_3  var_4  var_5
#>   <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 var_1 NA     -0.190 -0.358 -0.537 -0.551
#> 2 var_2  0.190 NA     -0.322 -0.528 -0.544
#> 3 var_3  0.358  0.322 NA     -0.682 -0.667
#> 4 var_4  0.537  0.528  0.682 NA     -0.595
#> 5 var_5  0.551  0.544  0.667  0.595 NA

由於 memory 似乎是您的主要問題,使用迭代器怎么樣? 使用 package RcppAlgos * ,我們可以使用permuteIter計算您的比率N。

如果必須有名稱,我們需要一個額外的迭代器。 這意味着您必須保持 2 個迭代器同步,這可能會變得乏味。 幸運的是,使用permuteItersummary()方法,我們總能看到當前索引是什么,並使用各種選項(例如隨機訪問[[front()back()startOver() )重置它們。

library(RcppAlgos)
df <- matrix(rexp(860400), nrow = 400, ncol = 2151)

ratioIter <- permuteIter(ncol(df), 2, FUN = function(x) {
    (df[, x[2]] - df[, x[1]]) / (df[, x[2]] + df[, x[1]])
})

## if you really want to name your output, you must have
## an additional name iterator... not very elegant
nameIter <- permuteIter(paste0("col", 1:ncol(df1)), 2, FUN = function(x) {
    paste0(rev(x), collapse = "_")
})

firstIter <- matrix(ratioIter$nextIter(), ncol = 1)
firstName <- nameIter$nextIter()
colnames(firstIter) <- firstName

head(firstIter)
      col2_col1
[1,]  0.2990054
[2,] -0.9808111
[3,] -0.9041054
[4,]  0.7970873
[5,]  0.8625776
[6,]  0.2768359

## returns a list, so we call do.call(cbind
next5Iter <- do.call(cbind, ratioIter$nextNIter(5))
next5Names <- unlist(nameIter$nextNIter(5))
colnames(next5Iter) <- next5Names

head(next5Iter)
       col3_col1  col4_col1   col5_col1  col6_col1  col7_col1
[1,] -0.28099710  0.1665687  0.40565958 -0.7524038 -0.7132844
[2,] -0.81434900 -0.4283759 -0.89811556 -0.8462906 -0.5399741
[3,] -0.02289368  0.4285012  0.05087853 -0.5091659 -0.2328995
[4,] -0.06825458  0.3126928  0.68968843 -0.2180618  0.6651785
[5,]  0.33508319  0.7389108  0.84733425  0.9065263  0.8977107
[6,]  0.61773589  0.3443120  0.61084584  0.5727938  0.3888807

您應該注意,這不會顯示i == j的結果(這些給出NaN )。 所以總數略低於 2151 2 (實際上它正好等於2151^2 - 2151 )。

ratioIter$summary()
$description
[1] "Permutations of 2151 choose 2"

$currentIndex
[1] 6

$totalResults
[1] 4624650

$totalRemaining
[1] 4624644

甚至還有隨機訪問和以前的迭代器:

## Get the last ratio
lastIter <- ratioIter$back()
lastName <- nameIter$back()
mLast <- matrix(lastIter, ncol = 1)
colnames(mLast) <- lastName

head(mLast)
     col2150_col2151
[1,]      -0.6131926
[2,]       0.9936783
[3,]       0.1373538
[4,]       0.1014347
[5,]      -0.5061608
[6,]       0.5773503

## iterate backwards with the previous methods
prev5Iter <- do.call(cbind, ratioIter$prevNIter(5))
prev5Names <- unlist(nameIter$prevNIter(5))
colnames(prev5Iter) <- prev5Names

head(prev5Iter)
     col2149_col2151 col2148_col2151 col2147_col2151 col2146_col2151 col2145_col2151
[1,]     -0.75500069     -0.72757136     -0.94457988     -0.82858884     -0.25398782
[2,]      0.99696694      0.99674084      0.99778638      0.99826472      0.95738947
[3,]      0.27701596      0.45696010      0.00682574      0.01529448     -0.62368764
[4,]     -0.09508689     -0.90698165     -0.38221934     -0.41405984      0.01371556
[5,]     -0.31580709     -0.06561386     -0.07435058     -0.08033145     -0.90692881
[6,]      0.82697720      0.86858595      0.81707206      0.75627297      0.46272349

## Get a random sample
set.seed(123)
randomIter <- do.call(cbind, ratioIter[[sample(4624650, 5)]])

## We must reset the seed in order to get the same output for the names
set.seed(123)
randomNames <- unlist(nameIter[[sample(4624650, 5)]])
colnames(randomIter) <- randomNames

head(randomIter)
     col1044_col939 col20_col1552 col412_col2014 col1751_col1521 col337_col1295
[1,]     -0.3902066     0.4482747   -0.108018200      -0.1662857     -0.3822436
[2,]     -0.2358101     0.9266657   -0.657135882       0.0671608     -0.6821823
[3,]     -0.7054217     0.8944720    0.092363665       0.2667708      0.1908249
[4,]     -0.1574657     0.2775225   -0.221737223       0.3381454     -0.5705021
[5,]     -0.4282909    -0.4406433    0.092783086      -0.7506674     -0.1276932
[6,]      0.9998189    -0.2497586   -0.009375891       0.7071864     -0.2425258

最后,它是用C++編寫的,所以速度非常快:

system.time(ratioIter$nextNIter(1e3))
#  user  system elapsed 
#     0       0       0

*我是RcppAlgos的作者

您可以使用rcpp使您的代碼更快:

Rcpp::cppFunction("
  std::vector<double> my_fun(arma::mat& x, arma::vec& y){
    int p = x.n_cols - 1;
    std::vector<double> result;
    for(int i = 0; i<p; i++){
      auto m = (x.cols(i+1, p).each_col() - x.col(i));
      m /= (x.cols(i+1, p).each_col() + x.col(i));
      auto a = arma::conv_to<std::vector<double>>::from(arma::cor(m, y));
      result.insert(result.end(), a.begin(), a.end());
    }
      
   return result;
}", 'RcppArmadillo')

my_fun(df, y) # takes approximately 14seconds. 

您可以使用 STL 函數使其更快。 雖然代碼會更長。 在我的電腦上這需要 6 秒

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM