[英]Advanced join two dataframe spark scala
I have to join two Dataframes. 我必须加入两个数据框。
Sample: Dataframe1 looks like this 示例:Dataframe1看起来像这样
df1_col1 df1_col2
a ex1
b ex4
c ex2
d ex6
e ex3
Dataframe2 Dataframe2
df2_col1 df2_col2
1 a,b,c
2 d,c,e
3 a,e,c
In result Dataframe I would like to get result like this 结果数据框我想得到这样的结果
res_col1 res_col2 res_col3
a ex1 1
a ex1 3
b ex4 1
c ex2 1
c ex2 2
c ex2 3
d ex6 2
e ex3 2
e ex3 3
What will be the best way to achieve this join? 实现这种加入的最佳方法是什么?
I have updated the code below 我已经更新了下面的代码
val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3")))
val df2 = sc.parallelize(Seq(List(("1","a,b,c"),("2","d,c,e")))).toDF
df2.withColumn("df2_col2_explode", explode(split($"_2", ","))).select($"_1".as("df2_col1"),$"df2_col2_explode").join(df1.select($"_1".as("df1_col1"),$"_2".as("df1_col2")), $"df1_col1"===$"df2_col2_explode","inner").show
You just need to split the values and generate multiple rows by exploding it and then join with the other dataframe. 您只需要拆分这些值并通过展开它来生成多行,然后与另一个数据框合并。
You can refer this link, How to split pipe-separated column into multiple rows? 您可以参考此链接, 如何将以管道分隔的列拆分为多行?
I used spark sql for this join, here is a part of code; 我为连接使用了spark sql,这是代码的一部分;
df1.createOrReplaceTempView("temp_v_df1")
df2.createOrReplaceTempView("temp_v_df2")
val df_result = spark.sql("""select
| b.df1_col1 as res_col1,
| b.df1_col2 as res_col2,
| a.df2_col1 as res_col3
| from (select df2_col1, exp_col
| from temp_v_df2
| lateral view explode(split(df2_col2,",")) dummy as exp_col) a
| join temp_v_df1 b on a.exp_col = b.df1_col1""".stripMargin)
I used spark scala data frame to achieve your desire output. 我使用spark scala数据框来实现所需的输出。
val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3"))).toDF("df1_col1","df1_col2")
val df2 = sc.parallelize(Seq((1,("a,b,c")),(2,("d,c,e")),(3,("a,e,c")))).toDF("df2_col1","df2_col2")
df2.withColumn("_tmp", explode(split($"df2_col2", "\\,"))).as("temp").join (df1,$"temp._tmp"===df1("df1_col1"),"inner").drop("_tmp","df2_col2").show
Desire Output
需求输出
+--------+--------+--------+
|df2_col1|df1_col1|df1_col2|
+--------+--------+--------+
| 2| e| ex3|
| 3| e| ex3|
| 2| d| ex6|
| 1| c| ex2|
| 2| c| ex2|
| 3| c| ex2|
| 1| b| ex4|
| 1| a| ex1|
| 3| a| ex1|
+--------+--------+--------+
Rename the Column according to your requirement. 根据您的要求重命名列。
Here the screenshot of running code 这里是运行代码的屏幕截图
Happy Hadoooooooooooooooppppppppppppppppppp
快乐Hadoooooooooooooooppppppppppppppppppppp
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.