简体   繁体   English

基于顶点ID创建边缘会触发scala

[英]Creating edges based on vertices IDs spark scala

I want to basically join two RDDs vertices and edges. 我想基本上连接两个RDD顶点和边。 Vertices and edges are created using the following code: 使用以下代码创建顶点和边:

val file = sc.textFile("file.gz") //This tab separated file has more than two columns among which only first two columns with source and destination URL are relevant 

val edges= file.flatMap(f => {
  val urls = f.split("\t")
  if (!(urls.length < 2)) 
{ Some(urls(0) +"\t"+ urls(1)) }
else None }).distinct

val vertices = edges.flatMap(f => f.split("\t")).distinct 
val vertices_zip = vertices.zipWithUniqueId

Now I have a list of vertices (URLs) with IDs generated using using the above method like: 现在我有一个顶点(URL)列表,其中包含使用上述方法生成的ID,如:

google.de/2011/10/Extract-host,11
facebook.de/2014/11/photos,28         
community.cloudera.com/t5/,42         
facebook.de/2020/11/photos,91 

I would like to create edges based on these IDs. 我想基于这些ID创建边。 Edges RDD file is tab separated like below: 边缘RDD文件的选项卡分隔如下:

google.de/2011/10/Extract-host   facebook.de/2014/11/photos   
facebook.de/2014/11/photos       community.cloudera.com/t5/,42
community.cloudera.com/t5/       google.de/2011/10/Extract-host

Required result: 要求的结果:

11     28
28     42
42     11

I tried the following code 我尝试了以下代码

val edges_id = edges.flatMap( line => line.split( "\t" ) ).map( line => ( line,0) ) .join(vert_zip).map(x=>x._2._2)

But not getting the desired result. 但没有得到理想的结果。 I am getting 我正进入(状态

11
28
28
42
42
11

I am not sure how to join the edges with the vertices RDD to get this result. 我不确定如何将边缘与顶点RDD连接以获得此结果。 Any help would be much appreciated. 任何帮助将非常感激。

When you zipWithUniqueId , then collect the rdds as map and then use that map to get the indexes in the edges rdd as following zipWithUniqueId ,然后将rdds收集为map ,然后使用该映射获取rdd边缘的索引,如下所示

val vertices_zip = vertices.zipWithUniqueId.collectAsMap

val edges_id = edges.map(f => {
  val urls = f.split("\t")
  vertices_zip(urls(0))+"\t"+vertices_zip(urls(1))
})

Thats all. 就这样。 I hope the answer is helpful 我希望答案是有帮助的

Updated 更新

you commented 你评论道

I am getting an exception : java.lang.OutOfMemoryError: Java heap space 我得到一个例外:java.lang.OutOfMemoryError:Java堆空间

for that you can use broadcast which would call required rdds to the executors memory instead of all the map 为此你可以使用广播,它将所需的rdds调用到执行程序内存而不是所有的映射

val vertices_zip = sc.broadcast(vertices.zipWithUniqueId.collectAsMap)

val edges_id = edges.map(f => {
  val urls = f.split("\t")
  vertices_zip.value(urls(0))+"\t"+vertices_zip.value(urls(1))
})

joins 加入

You've commented again 你再次评论过

Is it possible to change the code I tried above to get the result (the one with the join)? 是否可以更改我上面尝试的代码以获得结果(带连接的代码)?

join way would require two joins meaning that two shuffles would be needed to get the desired result join方式需要两个连接,这意味着需要两个shuffle才能获得所需的结果

val vertices_zip = vertices.zipWithUniqueId

val edges_id = edges.map(line => {
  val splitted = line.split("\t")
  (splitted(0), splitted(1))
})
  .join(vertices_zip)
  .map(_._2)
  .join(vertices_zip)
  .map(x => x._2._1+"\t"+x._2._2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM