简体   繁体   English

无法在GraphX(Scala Spark)中创建图形

[英]Cannot create graph in GraphX (Scala Spark)

I have huge problems creating a simple graph in Spark GraphX. 我在Spark GraphX中创建简单图形时遇到了很大的问题。 I really don't understand anything so I try everything that I find but nothing works. 我真的什么都不懂,所以我尝试了所有发现的东西,但是没有用。 For example I try to reproduce the steps from here . 例如,我尝试从此处重现步骤。

The following two were OK: 以下两个可以:

val flightsFromTo = df_1.select($"Origin",$"Dest")

val airportCodes = df_1.select($"Origin", $"Dest").flatMap(x => Iterable(x(0).toString, x(1).toString))

But after this I obtain an error: 但是之后,我得到一个错误:

val airportVertices: RDD[(VertexId, String)] = airportCodes.distinct().map(x => (MurmurHash.stringHash(x), x))

Error: missing Parameter type 错误:缺少参数类型

Could You please tell me what is wrong? 你能告诉我出什么事了吗?

And by the way, why MurmurHash? 顺便说一句,为什么选择MurmurHash? What is a purpose of it? 目的是什么?

My guess is that you are working from a 3 year old tutorial with a recent Spark version. 我的猜测是,您正在使用具有3年历史的最新Spark版本的教程。 The sqlContext read returns a Dataset instead of RDD. 读取的sqlContext返回一个数据集而不是RDD。 If you want it like the tutorial use .rdd. 如果您希望像本教程一样使用.rdd. instead 代替

val airportVertices: RDD[(VertexId, String)] = airportCodes.rdd.distinct().map(x => (MurmurHash3.stringHash(x), x))

or change type of variable 或更改变量类型

val airportVertices: Dataset[(Int, String)] = airportCodes.distinct().map(x => (MurmurHash3.stringHash(x), x))

You could also checkout https://graphframes.github.io/ if you are interested in Graphs and Spark 如果您对Graphs和Spark感兴趣,也可以签出https://graphframes.github.io/


Updated 更新

To create a Graph you need vertices and edges To make computation easier all vertices have to be identified by a VertexId (in essence a Long) 要创建图,您需要顶点和边以简化计算,所有顶点都必须由VertexId(本质上是Long)来标识

The MurmerHash is used to create very good unique hashes. MurmerHash用于创建非常好的独特哈希。 More info here: MurmurHash - what is it? 更多信息在这里: MurmurHash-这是什么?

Hashing is a best practise to distribute the data without skewing, but there is no technical reason why you couldn't use an incremental counter for each vertex 散列是分发数据而不歪斜的最佳实践,但是没有技术原因不能对每个顶点使用增量计数器

I've looked at the tutorial, but the only thing you have to change to make it work, is to add .rdd : 我看过该教程,但是要使其正常运行,您唯一需要更改的就是添加.rdd

val flightsFromTo = df_1.select($"Origin",$"Dest").rdd
val airportCodes = df_1.select($"Origin", $"Dest").flatMap(x => Iterable(x(0).toString, x(1).toString)).rdd

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM