[英]How to create a graph from an RDD/DF? Scala Spark
我的RDD实际上包含一些生物学数据,即蛋白质名称,以及它们之间的相似度。 我想创建图,其中顶点是蛋白质,边表示相似值。 这实际上是我的RDD:
+-------------+------------+------------+
| Protein1 | Protein2 | Similarity |
+-------------+------------+------------+
| P28469 | Q70UP5 | 0.11111111 |
| O45687 | P00325 | 1.0 |
| A7ME43 | Q5HG16 | 0.6 |
| A4VJT7 | Q9LD43 | 1.0 |
| P31937 | Q64415 | 0.07692308 |
| A1VAA0 | Q9L298 | 1.0 |
| B8DG74 | Q6MT35 | 1.0 |
+-------------+------------+------------+
谢谢!
不是相同的数据,但您需要这样做(当然来自文件)并将这种方法适应您的数据:
// Vertex DataFrame
val v = sqlContext.createDataFrame(List(
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)
)).toDF("id", "name", "age")
// Edge DataFrame
val e = sqlContext.createDataFrame(List(
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
)).toDF("src", "dst", "relationship")
val g = GraphFrame(v, e)
在你的情况下:
// i remember your question on distinct, but not sure if we need ditinct or not
// you talk about RDD but looks like a dataframe, let us assume RDD
//RDD tuple, simulated from file
val rdd = sc.parallelize(Array(("p1", "p2", 1),
("p1", "p3", 2),
("p2", "p4", 3),
("p5", "p6", 4)))
val v = rdd.map(x => x._1).union(rdd.map(x => x._2)).distinct.toDF("protein")
v.collect
val e = rdd.map(x => (x._1, x._2, x._3)).toDF("protein1", "protein2", "similarity")
v.show(false)
e.show(false)
val g = GraphFrame(v, e)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.