简体   繁体   English

高级连接两个数据框Spark Scala

[英]Advanced join two dataframe spark scala

I have to join two Dataframes. 我必须加入两个数据框。

Sample: Dataframe1 looks like this 示例:Dataframe1看起来像这样

df1_col1      df1_col2
   a            ex1
   b            ex4
   c            ex2
   d            ex6
   e            ex3

Dataframe2 Dataframe2

df2_col1      df2_col2
   1           a,b,c
   2           d,c,e
   3           a,e,c

In result Dataframe I would like to get result like this 结果数据框我想得到这样的结果

res_col1      res_col2       res_col3
    a           ex1             1
    a           ex1             3
    b           ex4             1
    c           ex2             1
    c           ex2             2
    c           ex2             3
    d           ex6             2
    e           ex3             2
    e           ex3             3

What will be the best way to achieve this join? 实现这种加入的最佳方法是什么?

I have updated the code below 我已经更新了下面的代码

val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3")))
val df2 = sc.parallelize(Seq(List(("1","a,b,c"),("2","d,c,e")))).toDF
df2.withColumn("df2_col2_explode", explode(split($"_2", ","))).select($"_1".as("df2_col1"),$"df2_col2_explode").join(df1.select($"_1".as("df1_col1"),$"_2".as("df1_col2")), $"df1_col1"===$"df2_col2_explode","inner").show

You just need to split the values and generate multiple rows by exploding it and then join with the other dataframe. 您只需要拆分这些值并通过展开它来生成多行,然后与另一个数据框合并。

You can refer this link, How to split pipe-separated column into multiple rows? 您可以参考此链接, 如何将以管道分隔的列拆分为多行?

I used spark sql for this join, here is a part of code; 我为连接使用了spark sql,这是代码的一部分;

df1.createOrReplaceTempView("temp_v_df1")
df2.createOrReplaceTempView("temp_v_df2")
val df_result = spark.sql("""select 
                    |   b.df1_col1 as res_col1, 
                    |   b.df1_col2 as res_col2, 
                    |   a.df2_col1 as res_col3  
                    |   from (select df2_col1, exp_col 
                    |         from temp_v_df2 
                    |        lateral view explode(split(df2_col2,",")) dummy as exp_col) a
                    |   join temp_v_df1 b on a.exp_col = b.df1_col1""".stripMargin)

I used spark scala data frame to achieve your desire output. 我使用spark scala数据框来实现所需的输出。

val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3"))).toDF("df1_col1","df1_col2") 

val df2 = sc.parallelize(Seq((1,("a,b,c")),(2,("d,c,e")),(3,("a,e,c")))).toDF("df2_col1","df2_col2") 

df2.withColumn("_tmp", explode(split($"df2_col2", "\\,"))).as("temp").join (df1,$"temp._tmp"===df1("df1_col1"),"inner").drop("_tmp","df2_col2").show

Desire Output 需求输出

+--------+--------+--------+
|df2_col1|df1_col1|df1_col2|
+--------+--------+--------+
|       2|       e|     ex3|
|       3|       e|     ex3|
|       2|       d|     ex6|
|       1|       c|     ex2|
|       2|       c|     ex2|
|       3|       c|     ex2|
|       1|       b|     ex4|
|       1|       a|     ex1|
|       3|       a|     ex1|
+--------+--------+--------+

Rename the Column according to your requirement. 根据您的要求重命名列。

Here the screenshot of running code 这里是运行代码的屏幕截图

在此处输入图片说明 Happy Hadoooooooooooooooppppppppppppppppppp 快乐Hadoooooooooooooooppppppppppppppppppppp

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM