简体   繁体   English


[英]Spark sum columns from different dataframes

We have two dataframes (note Scala syntax for illustrating), 我们有两个数据框(请注意Scala语法说明),

val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x")

val df2 = sc.parallelize(2 to 4).map(i => (i,i*100)).toDF("id","y") 

How to sum up one column from each frame so that we obtain this new dataframe, 如何总结每一帧的一列,以便我们获得这个新的数据帧,

| id| x_plus_y|
|  1|       10|
|  2|      220|
|  3|      330|
|  4|      440|

Note Tried this, but it nullifies the first row, 注意尝试过此操作,但它使第一行无效,

sqlContext.sql("select df1.id, x+y as x_plus_y from df1 left join df2 on df1.id=df2.id").show
| id|x_plus_y|
|  1|    null|
|  2|     220|
|  3|     330|
|  4|     440|
df3 = df1.join(df2, df1.id == df2.id, "left_outer").select(df1.id, df1.x, df2.y).fillna(0)
df3.select("id", (df3.x + df3.y).alias("x_plus_y")).show()

This works in Python. 这适用于Python。

You don't even need to use an UDF for that : 您甚至不需要为此使用UDF:

val df3 = df1.as('a).join(df2.as('b), $"a.id" === $"b.id","left").
               select(df1("id"),'x,'y,(coalesce('x, lit(0)) + coalesce('y, lit(0))).alias("x_plus_y")).na.fill(0)

// df3: org.apache.spark.sql.DataFrame = [id: int, x: int, y: int, x_plus_y: int]
// +---+---+---+--------+
// | id|  x|  y|x_plus_y|
// +---+---+---+--------+
// |  1| 10|  0|      10|
// |  2| 20|200|     220|
// |  3| 30|300|     330|
// |  4| 40|400|     440|
// +---+---+---+--------+

In Scala noticed this solution, 在Scala中注意到了这种解决方案,

val d = sqlContext.sql("""
  select df1.id, x, y from df1 left join df2 on df1.id=df2.id""").na.fill(0)

to join the frames and replace non available values with zeroes, and then define this UDF, 加入框架并将不可用的值替换为零,然后定义此UDF,

import org.apache.spark.sql.functions
import org.apache.spark.sql.functions._

val plus: (Int,Int) => Int = (x:Int,y:Int) => x+y
val plus_udf = udf(plus)

d.withColumn("x_plus_y", plus_udf($"x", $"y")).show
| id|  x|  y|x_plus_y|
|  1| 10|  0|      10|
|  2| 20|200|     220|
|  3| 30|300|     330|
|  4| 40|400|     440|

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM