[英]Sum the Distance in Apache-Spark dataframes
The Following code gives a dataframe having three values in each column as shown below. 以下代码给出了每列中具有三个值的数据帧,如下所示。
import org.graphframes._
import org.apache.spark.sql.DataFrame
val v = sqlContext.createDataFrame(List(
("1", "Al"),
("2", "B"),
("3", "C"),
("4", "D"),
("5", "E")
)).toDF("id", "name")
val e = sqlContext.createDataFrame(List(
("1", "3", 5),
("1", "2", 8),
("2", "3", 6),
("2", "4", 7),
("2", "1", 8),
("3", "1", 5),
("3", "2", 6),
("4", "2", 7),
("4", "5", 8),
("5", "4", 8)
)).toDF("src", "dst", "property")
val g = GraphFrame(v, e)
val paths: DataFrame = g.bfs.fromExpr("id = '1'").toExpr("id = '5'").run()
paths.show()
val df=paths
df.select(df.columns.filter(_.startsWith("e")).map(df(_)) : _*).show
OutPut of Above Code is given below:: 以上代码的OutPut如下:
+-------+-------+-------+
| e0| e1| e2|
+-------+-------+-------+
|[1,2,8]|[2,4,7]|[4,5,8]|
+-------+-------+-------+
In the above output, we can see that each column has three values and they can be interpreted as follows. 在上面的输出中,我们可以看到每列有三个值,它们可以解释如下。
e0 :
source 1, Destination 2 and distance 8
e1:
source 2, Destination 4 and distance 7
e2:
source 4, Destination 5 and distance 8
basically e0
, e1
, and e3
are the edges. 基本上
e0
, e1
和e3
是边缘。 I want to sum the third element of each column, ie add the distance of each edge to get the total distance. 我想总结每列的第三个元素,即添加每个边的距离以获得总距离。 How can I achieve this?
我怎样才能做到这一点?
It can be done like this: 它可以这样做:
val total = df.columns.filter(_.startsWith("e"))
.map(c => col(s"$c.property")) // or col(c).getItem("property")
.reduce(_ + _)
df.withColumn("total", total)
I would make a collection of the columns to sum and then use a foldLeft
on a UDF
: 我会将列的集合汇总,然后在
UDF
上使用foldLeft
:
scala> val df = Seq((Array(1,2,8),Array(2,4,7),Array(4,5,8))).toDF("e0", "e1", "e2")
df: org.apache.spark.sql.DataFrame = [e0: array<int>, e1: array<int>, e2: array<int>]
scala> df.show
+---------+---------+---------+
| e0| e1| e2|
+---------+---------+---------+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]|
+---------+---------+---------+
scala> val colsToSum = df.columns
colsToSum: Array[String] = Array(e0, e1, e2)
scala> val accLastUDF = udf((acc: Int, col: Seq[Int]) => acc + col.last)
accLastUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,List(IntegerType, ArrayType(IntegerType,false)))
scala> df.withColumn("dist", colsToSum.foldLeft(lit(0))((acc, colName) => accLastUDF(acc, col(colName)))).show
+---------+---------+---------+----+
| e0| e1| e2|dist|
+---------+---------+---------+----+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]| 23|
+---------+---------+---------+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.