简体   繁体   中英

How would I merge these two dataframes to produce the third dataframe in Spark Scala?

I'm having difficulty joining these two dataframe views because of not being able to modify specific column values in spark scala. I think I have to do a transpose/join somehow, but am not able to figure it out.

Here is the first dataframe:

  var sample_df = Seq(("john","morning","7am"),("john","night","10pm"),("bob","morning","8am"),("bob","night","11pm"),("phil","morning","9am"),("phil","night","10pm")).toDF("person","time_of_day","wake/sleep hour")

在此处输入图像描述

here is the second dataframe:

  var sample_df2 = Seq(("john","6am","11pm"),("bob","7am","2am"),("phil","8am","1am")).toDF("person","morning_earliest","night_latest")

在此处输入图像描述

and here is the resulting dataframe I'm looking to produce:

  var resulting_df = Seq(("john","morning","7am","6am"),("john","night","10pm","11pm"),("bob","morning","8am","7am"),("bob","night","11pm","2am"),("phil","morning","9am","8am"),("phil","night","10pm","1am")).toDF("person","time_of_day","wake/sleep hour","earliest/latest")

在此处输入图像描述

Any help would be greatly appreciated! Thanks and have a great day!

sample_df.createOrReplaceTempView("df1")
sample_df2.createOrReplaceTempView("df2")

spark.sql("""
select person, time_of_day, `wake/sleep hour`, `earliest/latest`
from (
    select person, stack(2, 'morning', morning_earliest, 'night', night_latest) as (time_of_day, `earliest/latest`)
    from df2
) df
join df1
using (time_of_day, person)
""").show()

+------+-----------+---------------+---------------+
|person|time_of_day|wake/sleep hour|earliest/latest|
+------+-----------+---------------+---------------+
|  john|    morning|            7am|            6am|
|  john|      night|           10pm|           11pm|
|   bob|    morning|            8am|            7am|
|   bob|      night|           11pm|            2am|
|  phil|    morning|            9am|            8am|
|  phil|      night|           10pm|            1am|
+------+-----------+---------------+---------------+
val df = sample_df .join(sample_df2,"person") val resulting_df = df.withColumn("earliest/latest", when(col("time_of_day")=== "morning", $"morning_earliest") .otherwise($"night_latest")) .drop($"morning_earliest") .drop($"night_latest") resulting_df.show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM