内部加入Spark（Scala）

Question

i am trying to achieve type1 kind of functionality with Spark dataframes but not getting the desired outcome. 我正在尝试使用Spark数据帧实现type1类型的功能，但未获得预期的结果。 I am a beginner to Spark. 我是Spark的初学者。 Scenario- Here are 2 data frames i have SRC(source data) and TGT(Target Data) with join_key (Account_nbr,Location_cd) and they each look like- 场景-这是2个数据帧，我具有带有join_key（Account_nbr，Location_cd）的SRC（源数据）和TGT（目标数据），它们各自看起来像-

SRC_DF- (Fresh Received data from source on current Day)- SRC_DF-（当日从源接收的新数据）-

Account_nbr|Location_cd|State_IN|REF_IN
1234567|1000|A|Y
3456789|2000|I|N
6789123|5000|A|Y

TGT_DF- (2 of these above accounts are already present in Target as)- TGT_DF-（以上两个帐户中的Target as已经存在）-

DIM_ID|Account_nbr|Location_cd|State_IN|REF_IN
900000|1234567|1000|I|N
900001|3456789|2000|A|Y

Here is what i tried to Run and the outcome (expected)- 这是我尝试运行的结果以及预期的结果-

val join_output= TGT_DF.join(SRC_DF,Seq(Key))

DIM_ID|Account_nbr|Location_cd|State|REF_IN|State|REF_IN
900000|1234567|1000|I|N|A|Y    
900001|3456789|2000|A|Y|I|N

Question 1- How can i suppress State and ACTV_IN from TGT_DF in the output and get below desired Output? 问题1-如何抑制输出中TGT_DF的状态和ACTV_IN并使其低于期望的输出？

DIM_ID|Account_nbr|Location_cd|State|REF_IN    
900000|1234567|1000|A|Y - (Type 1 update)
900001|3456789|2000|I|N - (Type 1 update)
900002|6789123|5000|A|Y - (New Insert-1st Occurance)

Question 2- Whats the best way to generate the new dim ID for new Inserts (Existing Max(dim_id) in target onwards)? 问题2-为新插入物生成新的暗淡ID的最佳方法是什么（目标开始时已存在Max（dim_id））？

Also, i want this logic to be generic (to be used for other tables as well) which can be handled by three parameters - (src,tgt,join_key) or more if required. 另外，我希望这种逻辑是通用的（也可用于其他表），可以通过三个参数（src，tgt，join_key）或如果需要的话更多的参数来处理。

Thanks, Sid 谢谢，席德

Answer 1

The Ideal function you desire would require a join , selecting required fields, separating joined dataframes for valid DIM_ID and null DIM_ID, populating null DIM_ID from the max DIM_ID, updating REF_IN column and finally merging both separated dataframes . 您所需的Ideal函数需要一个联接， 选择必填字段，为有效DIM_ID和空DIM_ID分离联接的数据帧，从最大DIM_ID中填充空DIM_ID，更新REF_IN列并最终合并两个分离的数据帧 。

Above theory can be programmed as below (I have commented for clarification and you can make it more robust if you desire) 上面的理论可以如下编程 （我已发表评论以澄清问题，如果需要，可以使其更强大）

def func(src: DataFrame, trgt: DataFrame, join_key: Array[String], select_keys: Array[String]) ={
  import org.apache.spark.sql.functions._
  //joining and selecting appropriate columns
  val select_columns = Array("trgt.DIM_ID") ++ select_keys.map(x => "src."+x)
  val joined_df = src.as("src").join(trgt.as("trgt"), Seq(join_key: _*), "left")
    .select(select_columns.map(col):_*)

  //separating joined dataframe for populating DIM_ID for null values
  val matched_df = joined_df.filter(col("DIM_ID").isNotNull)
  val not_matched_df = joined_df.filter(col("DIM_ID").isNull)

  //extracting the max DIM_ID for populating in the not-matched table
  val max_dim_id = matched_df.select(max("DIM_ID")).take(1)(0).getAs[Int](0)

  //generating DIM_ID for null values increasing from the max DIM_ID, which is expensive though
  import org.apache.spark.sql.expressions._
  val not_matched_df_with_id = not_matched_df.withColumn("DIM_ID", row_number().over(Window.orderBy("Account_nbr"))+max_dim_id)

  //merge both separated dataframes and return with REF_IN column modified according the your desire
  matched_df.withColumn("REF_IN", concat_ws(" - ", col("REF_IN"), lit("(Type 1 update)")))
    .union(not_matched_df_with_id.withColumn("REF_IN", concat_ws(" - ", col("REF_IN"), lit("(New Insert-1st Occurance)"))))
}

Finally you call the function as 最后，您将该函数称为

val select_columns = SRC_DF.columns
func(SRC_DF, TGT_DF, Array("Account_nbr","Location_cd"), select_columns)
  .show(false)

which should give you your desired output dataframe 这应该给你你想要的输出数据帧

+------+-----------+-----------+--------+------------------------------+
|DIM_ID|Account_nbr|Location_cd|State_IN|REF_IN                        |
+------+-----------+-----------+--------+------------------------------+
|900000|1234567    |1000       |A       |Y - (Type 1 update)           |
|900001|3456789    |2000       |I       |N - (Type 1 update)           |
|900002|6789123    |5000       |A       |Y - (New Insert-1st Occurance)|
+------+-----------+-----------+--------+------------------------------+

Answer 2

在Ramesh Maharjan答案中，如果您认为此操作（ row_number() ）昂贵，请尝试以下更好的选择。

joined_df.withColumn("DIM_ID", coalesce($"DIM_ID" , lit(max_dim) + lit(1) + monotonically_increasing_id())) // monotonically_increasing_id() generates numbers from 0... hence adding lit(1) along with max_dim

内部加入Spark（Scala）

问题描述

2 个解决方案

解决方案1
0 已采纳 2018-04-05 03:29:36

解决方案2
0 2018-04-05 11:22:51

内部加入Spark（Scala）

问题描述

2 个解决方案

解决方案1 0 已采纳 2018-04-05 03:29:36

解决方案2 0 2018-04-05 11:22:51

解决方案1
0 已采纳 2018-04-05 03:29:36

解决方案2
0 2018-04-05 11:22:51