使用并行化在R中创建距离矩阵

Question

I have N vectors containing the cumulative frequencies of tweets, for clarification one of these vectors would like (0, 0, 1, 1, 2, 3, 4, 4, 5, 5, 6, 6, ...) 我有N个向量包含推文的累积频率，为了澄清其中一个向量想要（0,0,1,1,2,3,4,5,5,5,6,6 ......）

I wanted to visualize the differences in these frequencies by creating a heat map. 我想通过创建热图来可视化这些频率的差异。 For that I first wanted to create an NxN Matrix containing the euclidean distances between tweets. 为此，我首先要创建一个NxN矩阵，其中包含推文之间的欧氏距离。 My first approach is rather Java like and looks like this: 我的第一种方法是Java，看起来像这样：

create_dist <- function(x){
  n <- length(x)                             #number of tweets
  xy <- matrix(nrow=n, ncol=n)               #create NxN matrix
  colnames(xy) <- names(x)                   #set column
  rownames(xy) <- names(x)                   #and row names

  for(i in 1:n) {
    for(j in 1:n){
      xy[i,j] <- distance(x[[i]], x[[1]])    #calculate euclidean distance for now, but should be interchangeable 
    }
  }

  xy
}

I measured the time it takes to create this distance matrix, and for a small sample (around two thousand tweets) it already takes about 35 seconds. 我测量了创建这个距离矩阵所需的时间，对于一个小样本（大约两千条推文），它已经花了大约35秒。

> system.time(create_dist(cumFreqs))
user  system elapsed 
34.572   0.000  34.602

Now I thought about how I could speed up the calculation a little bit and because my computer has 8 cores I thought maybe if I use parallelization it's going to be faster. 现在我想到了如何加快计算速度，因为我的计算机有8个核心，我想如果我使用并行化它会更快。

Like the R novice I am I changed the inner for loop to a foreach loop. 像R新手一样，我将内部for循环更改为foreach循环。

#libraries
library(foreach)
library(doMC)
registerDoMC(4)

create_dist <- function(x){
  n <- length(x)                                #number of tweets
  xy <- matrix(nrow=n, ncol=n)                  #create NxN matrix
  colnames(xy) <- names(x)                      #set column
  rownames(xy) <- names(x)                      #and row names

  for(i in 1:n) {
    xy[i,] <- unlist(foreach(j=1:n) %dopar% {   #set each row of the matrix
      distance(x[[i]], x[[j]])
    })
  }

  xy
}

Again I wanted to measure the time it takes to create a distance matrix for a sample of two thousand tweets using system.time(), but I cancelled the execution after 10 minutes because obviously there isn't a speed up at all. 我想再次测量使用system.time（）为两千条推文的样本创建距离矩阵所需的时间，但是我在10分钟后取消了执行，因为显然根本没有加速。

I googled for solutions, but unfortunately I haven't found any. 我搜索了解决方案，但不幸的是我没有找到任何解决方案。 Now I wanted to ask you if there is a better way to create this distance matrix, maybe an apply function, which I have no shame admit still confuse me. 现在我想问你是否有更好的方法来创建这个距离矩阵，也许是一个应用函数，我毫不羞耻地承认我仍然困惑。

Answer 1

As mentioned you can use dist function. 如上所述，您可以使用dist功能。 Here an example of how to use the result of dist to create a heatmap. 这里是一个如何使用dist结果创建热图的示例。

nn <- paste0('row',1:5)
x <- matrix(rnorm(25), nrow = 5,dimnames=list(nn))
distObj <- dist(x)
cols <- c("#D33F6A", "#D95260", "#DE6355", "#E27449", 
            "#E6833D", "#E89331", "#E9A229", "#EAB12A", "#E9C037", 
            "#E7CE4C", "#E4DC68", "#E2E6BD")
## mandatory coercion
distObj <- as.matrix(distObj)
## hetamap
image(distObj[order(nn), order(nn)], col = cols, 
      xaxt = "n", yaxt = "n")
## axes labels
axis(1, at = seq(0, 1, length.out = dim(distObj)[1]), labels = nn, 
     las = 2)
axis(2, at = seq(0, 1, length.out = dim(distObj)[1]), labels = nn, 
     las = 2)

在此输入图像描述

Answer 2

Like 'agstudy' suggests, use the builtin 'dist' function. 就像'agstudy'建议的那样，使用内置'dist'功能。

For future reference, nested for loops in R are pretty slow. 为了将来参考，R中的嵌套for循环非常慢。 As R is a functional language, try and use vectorised operations with functions such as the apply family (apply, lapply, sapply, tapply). 由于R是一种函数式语言，请尝试使用矢量化操作，例如apply family（apply，lapply，sapply，tapply）。 It takes some time to think about programming tasks in a functional way when you're used to a C-like paradigm. 当你习惯于类似C的范例时，需要花一些时间来考虑以功能方式编写任务。

A useful discussion on benchmarks between for loops and apply flavours is here: Is R's apply family more than syntactic sugar? 关于for循环和apply flavor之间基准的有用讨论在这里： R是否比句法糖更适用于家庭？

使用并行化在R中创建距离矩阵

问题描述

2 个解决方案

解决方案1
2 已采纳 2013-06-16 13:03:59

解决方案2
0 2013-07-04 13:35:37

使用并行化在R中创建距离矩阵

问题描述

2 个解决方案

解决方案1 2 已采纳 2013-06-16 13:03:59

解决方案2 0 2013-07-04 13:35:37

解决方案1
2 已采纳 2013-06-16 13:03:59

解决方案2
0 2013-07-04 13:35:37