简体   繁体   English

如何从 Spark Dataframe 构建图形(使用 Graphx)?

[英]How can I build graph (using Graphx) from Spark Dataframe?

I have already created a Spark DataFrame in order to build graph by Graphx which is Spark's API and accepts Spark Dataframe format.我已经创建了一个 Spark DataFrame,以便通过 Graphx 构建图形,Graphx 是 Spark 的 API 并接受 Spark Dataframe 格式。 So, now I have such a data,所以,现在我有这样的数据,

+--------------------+----------------+------+
|           hotel_url|          author|rating|
+--------------------+----------------+------+
|Hotel_Review-g194...|    violettaf340|     5|
|Hotel_Review-g194...|       Lagaiuzza|     5|
|Hotel_Review-g194...|      ashleyn763|     5|
|Hotel_Review-g194...|     DavideMauro|     5|
|Hotel_Review-g194...|        Alemma11|     4|
|Hotel_Review-g194...|       ladispoli|     4|
|Hotel_Review-g303...|       LiliT0URS|     3|
|Hotel_Review-g303...|     Amandainldn|     4|
|Hotel_Review-g303...|TwoMonkeysTravel|     5|
|Hotel_Review-g303...|     BiancaB3358|     4|
|Hotel_Review-g303...|    Brett-Sweden|     4|
|Hotel_Review-g303...|      analuizade|     5|
|Hotel_Review-g303...|          heckfy|     5|
|Hotel_Review-g303...|  MatheusMedrado|     3|
|Hotel_Review-g303...|TwoMonkeysTravel|     5|
|Hotel_Review-g303...|          SaStar|     4|
|Hotel_Review-g303...|   chrisbG2838DY|     4|
|Hotel_Review-g303...|        virninha|     5|
|Hotel_Review-g303...|    AugustusC_13|     5|
|Hotel_Review-g303...|         AnnaMir|     5|
+--------------------+----------------+------+

and I would like to ask you that how to create a graph which has [ (Node: hotel_url) --- (weight: rating) --- (Node: author)] such type of relationship from the Spark Dataframe?我想问你,如何从 Spark Dataframe 创建一个具有 [ (Node: hotel_url) --- (weight: rating) --- (Node: author)] 这种类型关系的图表?

You can also understand desired relationship from the given figure.您还可以从给定的图形中了解所需的关系。

图图

import org.apache.spark.sql.SparkSession
import org.apache.spark.graphx.Edge
import org.apache.spark.sql.types._
import org.apache.spark.graphx.Graph
import org.apache.spark.sql.functions._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import spark.implicits._

val data = List(
  ("Hotel_Review-g194...", "violettaf340", 5),
  ("Hotel_Review-g194...", "Lagaiuzza", 5),
  ("Hotel_Review-g194...", "ashleyn763", 5),
  ("Hotel_Review-g194...", "DavideMauro", 5),
  ("Hotel_Review-g194...", "Alemma11", 4),
  ("Hotel_Review-g194...", "ladispoli", 4),
  ("Hotel_Review-g303...", "LiliT0URS", 3),
  ("Hotel_Review-g303...", "Amandainldn", 4),
  ("Hotel_Review-g303...", "TwoMonkeysTravel", 5),
  ("Hotel_Review-g303...", "BiancaB3358", 4),
  ("Hotel_Review-g303...", "Brett-Sweden", 4),
  ("Hotel_Review-g303...", "analuizade", 5),
  ("Hotel_Review-g303...", "heckfy", 5),
  ("Hotel_Review-g303...", "MatheusMedrado", 3),
  ("Hotel_Review-g303...", "TwoMonkeysTravel", 5),
  ("Hotel_Review-g303...", "SaStar", 4),
  ("Hotel_Review-g303...", "chrisbG2838DY", 4),
  ("Hotel_Review-g303...", "virninha", 5),
  ("Hotel_Review-g303...", "AugustusC_13", 5),
  ("Hotel_Review-g303...", "AnnaMir", 5)
).toDF("hotel_url", "author", "rating")

val vertices: RDD[(VertexId, String)] = data
  .select(explode(array(col("hotel_url"), col("author"))))
  .dropDuplicates()
  .rdd
  .map(_.getAs[String](0))
  .zipWithIndex
  .map(_.swap)

val vertDF = vertices.toDF("id", "node")

val edges = data
  .join(vertDF, data.col("hotel_url") === vertDF("node"))
  .select('author, 'rating.cast(StringType), 'id as 'idS)
  .join(vertDF, data("author") === vertDF("node"))
  .rdd
  .map(row =>
    Edge(
      row.getAs[Long]("idS"),
      row.getAs[Long]("id"),
      "rating: " + row.getAs[String]("rating")
    )
  )

val graph = Graph(vertices, edges)

graph.vertices.foreach(println _)
//    (2,Amandainldn)
//    (7,heckfy)
//    (5,DavideMauro)
//    (0,MatheusMedrado)
//    (4,ashleyn763)
//    (1,LiliT0URS)
//    (3,chrisbG2838DY)
//    (9,Brett-Sweden)
//    (11,virninha)
//    (12,BiancaB3358)
//    (16,AnnaMir)
//    (10,TwoMonkeysTravel)
//    (6,SaStar)
//    (17,AugustusC_13)
//    (19,ladispoli)
//    (20,Alemma11)
//    (14,analuizade)
//    (8,Lagaiuzza)
//    (18,violettaf340)
//    (15,Hotel_Review-g194...)
//    (13,Hotel_Review-g303...)

graph.edges.foreach(println(_))
//    Edge(13,0,rating: 3)
//    Edge(13,1,rating: 3)
//    Edge(13,3,rating: 4)
//    Edge(15,4,rating: 5)
//    Edge(13,2,rating: 4)
//    Edge(13,12,rating: 4)
//    Edge(13,10,rating: 5)
//    Edge(13,10,rating: 5)
//    Edge(15,8,rating: 5)
//    Edge(15,5,rating: 5)
//    Edge(13,9,rating: 4)
//    Edge(13,6,rating: 4)
//    Edge(13,11,rating: 5)
//    Edge(15,18,rating: 5)
//    Edge(13,14,rating: 5)
//    Edge(13,16,rating: 5)
//    Edge(15,19,rating: 4)
//    Edge(13,17,rating: 5)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在我的 spark 数据帧中添加来自 spark 数据帧的列(使用 Pyspark)? - How can I add column from a spark dataframe in my spark dataframe(Using Pyspark)? 如何将二进制文件从 hdfs 读入 Spark 数据帧? - How can I read in a binary file from hdfs into a Spark dataframe? 如何从此数据框创建条形图? - How can I create a bar graph from this Dataframe? 如何以新的熊猫数据框的形式从 Python 中的熊猫数据框获取 networkx 图的分支? - How can I get branch of a networkx graph from pandas dataframe in Python in the form of a new pandas dataframe? 如何逐步构建 DataFrame - How can I build a DataFrame Incrementally 使用 Spark,如何在将所有内容加载到数据框中时获取文件名? - Using Spark, how can I pickup a filename when loading everything into a dataframe? 如何替换Spark数据框所有列中的多个字符? - How can I replace multiple characters from all columns of a spark dataframe? 如何从 Python 中的 pandas dataframe 获取 a.networkx 图的分支作为列表? - How can I get branch of a networkx graph as a list from pandas dataframe in Python? 如何有效地将Spark数据帧列转换为Numpy数组? - How can I convert Spark dataframe column to Numpy array efficiently? 如何有效地将 Spark 中的数据框与小文件目录连接起来? - How can I efficiently join a dataframe in spark with a directory of small files?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM