简体   繁体   English

如何在Apache Spark中连接两个数据框并将键合并到一列?

[英]How to join two data frames in Apache Spark and merge keys into one column?

I have two following Spark data frames: 我有两个以下的Spark数据框:

sale_df:

|user_id|total_sale|
+-------+----------+
|      a|      1100|
|      b|      2100|
|      c|      3300|
|      d|      4400  

and target_df: 和target_df:

 user_id|personalized_target|
+-------+-------------------+
|      b|               1000|
|      c|               2000|
|      d|               3000|
|      e|               4000|
+-------+-------------------+

How can I join them in a way that output is: 如何以输出方式加入它们:

user_id   total_sale   personalized_target
 a           1100            NA
 b           2100            1000
 c           3300            2000
 d           4400            4000
 e           NA              4000

I have tried all most all the join types but it seems that single join can not make the desired output. 我已经尝试了所有连接类型,但似乎单个连接无法生成所需的输出。

Any PySpark or SQL and HiveContext can help. 任何PySpark或SQL和HiveContext都可以提供帮助。

You can use the equi-join synthax in Scala 您可以在Scala中使用equi-join synthax

  val output = sales_df.join(target_df,Seq("user_id"),joinType="outer")

You should check if it works in python: 您应该检查它是否在python中工作:

   output = sales_df.join(target_df,['user_id'],"outer")

You need to perform an outer equi-join : 您需要执行外部等连接:

data1 = [['a', 1100], ['b', 2100], ['c', 3300], ['d', 4400]]
sales = sqlContext.createDataFrame(data1,['user_id','total_sale'])
data2 = [['b', 1000],['c',2000],['d',3000],['e',4000]]
target = sqlContext.createDataFrame(data2,['user_id','personalized_target'])

sales.join(target, 'user_id', "outer").show()
# +-------+----------+-------------------+
# |user_id|total_sale|personalized_target|
# +-------+----------+-------------------+
# |      e|      null|               4000|
# |      d|      4400|               3000|
# |      c|      3300|               2000|
# |      b|      2100|               1000|
# |      a|      1100|               null|
# +-------+----------+-------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM