大數據集導致R中的PCoA錯誤

Question

對於我的工作項目，我必須執行PCoA（主坐標分析，也稱為多維縮放）。 但是，當使用R執行此分析時，我遇到了一些問題。

函數cmdscale僅接受矩陣或dist作為輸入，dist函數給出錯誤：

Error: cannot allocate vector of size 4.2 Gb
In addition: Warning messages:
1: In dist(mydata[c(3, 4)], method = "euclidian", diag = FALSE, upper = FALSE) :
  Reached total allocation of 4020Mb: see help(memory.size)
2: In dist(mydata[c(3, 4)], method = "euclidian", diag = FALSE, upper = FALSE) :
  Reached total allocation of 4020Mb: see help(memory.size)
3: In dist(mydata[c(3, 4)], method = "euclidian", diag = FALSE, upper = FALSE) :
  Reached total allocation of 4020Mb: see help(memory.size)
4: In dist(mydata[c(3, 4)], method = "euclidian", diag = FALSE, upper = FALSE) :
  Reached total allocation of 4020Mb: see help(memory.size)

當我使用矩陣時，它將輸入更改為：

     [,1]         
[1,] Integer,33741
[2,] Integer,33741

數據集的內容無法在線發布，但是我可以為您提供尺寸：數據集長33741行，寬11列，第一列是ID，其他10個需要用於PCoA的值。

正如您在錯誤中看到的那樣，我僅使用2列，並且已經出現了內存錯誤。

現在我的問題是：
是否有可能以我可以使用dist函數的內存限制進行管理的方式來操縱數據？
我將矩陣向量更改為2列2行輸出的矩陣函數在做什么？

我嘗試過的操作：使用垃圾回收清除，重新啟動GUI，重新啟動系統。

系統：Windows 7 x64 i7 920qm 1.8GHz 4GB DDR3 ram

使用的代碼：

mydata <- read.table(file, header=TRUE)

mydist <- dist(mydata[c(3,4)], method="euclidian", diag=FALSE, upper=FALSE)
mymatrix <- matrix(mydata[c(3,4)], byrow=FALSE)
mymatrix <- matrix(cbind(mydata[c(3,4)]))

mycmdscale <- cmdscale(mydist, k=2, eig=FALSE, add=FALSE, x.ret=FALSE)
mycmdscale <- cmdscale(mymatrix, k=2, eig=FALSE, add=FALSE, x.ret=FALSE)

plot(mycmdscale)

當然，我沒有按此順序運行代碼，但是此代碼包含我嘗試加載數據的方法。

預先感謝您的任何答復。

Answer 1

您的內存太少，無法在R中執行此操作，因為R將所有對象都保留在內存中。 我可能沒有完全正確的精確計算（我忘記了R對象的大小），但只是為了保存相異矩陣，您需要約9GB的RAM。

> print(object.size(matrix(0, ncol = 34000, nrow = 34000)), units = "Gb")
8.6 Gb

dist會在內部表示形式中得到較少的應用，因為它實際上僅存儲0.5 * (nr * (nr - 1)) double（ nr是輸入數據中的行數）：

> print(object.size(numeric(length = 0.5 * 34000 * 33999)), units = "Gb")
4.3 Gb

[您看到的錯誤可能來自哪里]

實際上，一旦計算出差異矩陣，您將需要20-30GB以上的RAM來做有用的事情。 即使可以計算它們，PCoA解決方案的特征向量也需要約9Gb的RAM，僅它們自己即可。

因此，一個更相關的問題是： 您希望如何處理c。 34000個樣本/觀測值？

要從mydata[3:4]獲取矩陣，您可以使用

as.matrix(mydata[3:4])

或者，如果您有因子並希望保留其數字解釋

data.matrix(mydata[3:4])

Answer 2

我知道這很老了，但以為我會盡力而為...

我有點驚訝，@ Gavin Simpson並未提及在歐幾里得距離矩陣上進行主坐標分析與主分量分析相同（至少兩者都使用scale = 1）。

根據p。 143 in Borcard，D.，Gillet，F.，＆Legendre，P.（2011）。 第5章無約束排序（第115–151頁）。 紐約，紐約：紐約斯普林格。 doi：10.1007 / 978-1-4419-7976-6

我可以在當前的本地計算機系統上運行良好：Windows 7 x64 i5-2500 3.3ghz 8GB RAM

library(vegan) # to perform PCA and associated operations 
library(ggplot2) # plotting (not necessary, but nice)
library(grid) # arrow()

#make a big test set like OP's
test<-data.frame(id=seq(34000), var1=rnorm(34000), var2=rnorm(34000),
                 var3=rnorm(34000),var4=rnorm(34000),var5=rnorm(34000),
                 var6=rnorm(34000),var7=rnorm(34000),var8=rnorm(34000),
                 var9=rnorm(34000),var10=rnorm(34000))
#calculate PCA
test.pca<-rda(test, scale=TRUE)

#calculate percent variation on each axis
test.pca.percExp<-round(eigenvals(test.pca)/sum(eigenvals(test.pca))*100, 2)

#extract scores for plotting
test.pca.sc<-scores(test.pca, choices=c(1,2), 
                           display=c("sites", "species"), scaling=1)

test.pca.site<-data.frame(test.pca.sc$sites)
test.pca.spe<-data.frame(test.pca.sc$species)
test.pca.spe$VAR<-rownames(test.pca.spe)

#make the plot
test.pca.p<-ggplot(test.pca.site, aes(PC1, PC2)) + 
  xlab(sprintf("PC1 %s%s", test.pca.percExp[1], "%")) + 
  ylab(sprintf("PC2 %s%s", test.pca.percExp[2], "%")) 

#add points and biplot arrows to plot
test.pca.p + 
  geom_point() +
  geom_segment(data = test.pca.spe,
               aes(x = 0, xend = PC1, y = 0, yend = PC2),
               arrow = arrow(length = unit(0.25, "cm")), colour = "grey") +
  geom_text(data=test.pca.spe,
            aes(x=PC1,y=PC2,label=VAR),
            size=3, position=position_jitter(width=-2, height=0.1))+
  guides(color = guide_legend(title = "Var"))

在此處輸入圖片說明

#hard to see the points with arrows, so plot without the arrows
test.pca.p + 
  geom_point()

在此處輸入圖片說明

我偶然發現了這個問題，因為我對曼哈頓距離矩陣也遇到了同樣的問題，我的回答無濟於事（據我所知，也許有一種方法可以在PCA之前轉換數據，從而得到相同的結果。）。 這個答案從本質上會給出我相信OP正在尋找的結果。 希望這也可以幫助其他人...

大數據集導致R中的PCoA錯誤

問題描述

2 個解決方案

解決方案1
0 已采納 2013-05-14 15:10:55

解決方案2
0 2014-11-26 02:05:10

大數據集導致R中的PCoA錯誤

問題描述

2 個解決方案

解決方案1 0 已采納 2013-05-14 15:10:55

解決方案2 0 2014-11-26 02:05:10

解決方案1
0 已采納 2013-05-14 15:10:55

解決方案2
0 2014-11-26 02:05:10