在Spark中为每个执行器创建数组并合并到RDD中

Question

I am moving from MPI based systems to Apache Spark. 我正在从基于MPI的系统迁移到Apache Spark。 I need to do the following in Spark. 我需要在Spark中执行以下操作。

Suppose, I have n vertices. 假设我有n个顶点。 I want to create an edge list from these n vertices. 我想从这n个顶点创建一个边列表。 An edge is just a tuple of two integers (u,v), no attributes are required. 边仅是两个整数（u，v）的元组，不需要属性。

However, I want to create them in parallel independently in each executor. 但是，我想在每个执行程序中独立地并行创建它们。 Therefore, I want to create P edge arrays independently for P Spark Executors. 因此，我想为P Spark执行器独立创建P边缘数组。 Each array may be of different sizes and depends on the vertices, therefore, I also need the executor id from 0 to n-1 . 每个数组的大小可能不同，并且取决于顶点，因此，我还需要从0到n-1的执行程序ID。 Next, I want to have a global RDD Array of edges. 接下来，我要有一个全局的RDD边数组。

In MPI, I would create an array in each processor using the processor rank. 在MPI中，我将使用处理器等级在每个处理器中创建一个数组。 How do I do that in Spark, especially using the GraphX library? 如何在Spark中做到这一点，尤其是使用GraphX库？

Therefore, my primary goal is to create an array of edges in each executor and combine them into one single RDD. 因此，我的主要目标是在每个执行器中创建一组边缘，并将它们组合为一个RDD。

I am first trying one modified version of the Erdos--Renyi model. 我首先尝试一个鄂尔多斯-人一模型的修改版本。 As a parameter I only have the number of nodes n and a probability p. 作为参数，我只有节点数n和概率p。

Suppose, executor i has to process nodes from 101 to 200 . 假设执行者i必须处理101到200节点。 For any node say, node 101 , it will create edges from 101 to 102 -- n with probability p. 对于说节点101任何节点，它将以概率p创建从101到102 -- n边。 After each executor creates the allocated edges, I would instantiate the GraphX EdgeRDD and VertexRDD . 在每个执行者创建分配的边之后，我将实例化GraphX EdgeRDD和VertexRDD 。 Therefore, my plan is to create the edge lists independently in each executor, and merge them into RDD . 因此，我的计划是在每个执行程序中独立创建边缘列表，并将它们合并到RDD 。

Answer 1

Lets start with some imports and variables which will be required for downstream processing: 让我们从一些下游处理所需的导入和变量开始：

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import scala.util.Random
import org.apache.spark.HashPartitioner

val nPartitions: Integer = ???
val n: Long = ??? 
val p: Double = ???

Next we'll need an RDD of seed IDs which can be used to generate edges. 接下来，我们将需要可用于生成边缘的种子ID的RDD。 A naive way to handle this would be simply something like this: 一个简单的方法来解决这个问题就是这样：

sc.parallelize(0L to n)

Since number of the generated edges depends on the node id this approach would give a highly skewed load. 由于生成的边的数量取决于节点ID，因此此方法将产生高度偏斜的负载。 We can do a little bit better with repartitioning: 我们可以在重新分区方面做得更好：

sc.parallelize(0L to n)
  .map((_, None))
  .partitionBy(new HashPartitioner(nPartitions))
  .keys

but much better approach is to start with empty RDD and generate ids in place. 但是更好的方法是从空的RDD开始并在适当的位置生成ID。 We'll need a small helper: 我们需要一个小帮手：

def genNodeIds(nPartitions: Int, n: Long)(i: Int) = {
  (0L until n).filter(_ % nPartitions == i).toIterator
}

which can be used as follows: 可以如下使用：

val empty = sc.parallelize(Seq.empty[Int], nPartitions)
val ids = empty.mapPartitionsWithIndex((i, _) => genNodeIds(nPartitions, n)(i))

Just a quick sanity check (it is quite expensive so don't use it in production): 只需进行快速的健全性检查即可（这非常昂贵，因此请勿在生产中使用它）：

require(ids.distinct.count == n)

and we can generate actual edges using another helper: 我们可以使用另一个助手生成实际的边缘：

def genEdgesForId(p: Double, n: Long, random: Random)(i: Long) = {
  (i + 1 until n).filter(_ => random.nextDouble < p).map(j => Edge(i, j, ()))
}

def genEdgesForPartition(iter: Iterator[Long]) = {
  // It could be an overkill but better safe than sorry
  // Depending on your requirement it could worth to
  // consider using commons-math
  // https://commons.apache.org/proper/commons-math/userguide/random.html
  val random = new Random(new java.security.SecureRandom())
  iter.flatMap(genEdgesForId(p, n, random))
}

val edges = ids.mapPartitions(genEdgesForPartition)

Finally we can create a graph: 最后，我们可以创建一个图形：

val graph = Graph.fromEdges(edges, ())

在Spark中为每个执行器创建数组并合并到RDD中

问题描述

1 个解决方案

解决方案1
4 已采纳 2015-12-17 23:38:41

在Spark中为每个执行器创建数组并合并到RDD中

问题描述

1 个解决方案

解决方案1 4 已采纳 2015-12-17 23:38:41

解决方案1
4 已采纳 2015-12-17 23:38:41