i am trying to achieve type1 kind of functionality with Spark dataframes but not getting the desired outcome. I am a beginner to Spark. Scenario- Here are 2 data frames i have SRC(source data) and TGT(Target Data) with join_key (Account_nbr,Location_cd) and they each look like-
SRC_DF- (Fresh Received data from source on current Day)-
Account_nbr|Location_cd|State_IN|REF_IN
1234567|1000|A|Y
3456789|2000|I|N
6789123|5000|A|Y
TGT_DF- (2 of these above accounts are already present in Target as)-
DIM_ID|Account_nbr|Location_cd|State_IN|REF_IN
900000|1234567|1000|I|N
900001|3456789|2000|A|Y
Here is what i tried to Run and the outcome (expected)-
val join_output= TGT_DF.join(SRC_DF,Seq(Key))
DIM_ID|Account_nbr|Location_cd|State|REF_IN|State|REF_IN
900000|1234567|1000|I|N|A|Y
900001|3456789|2000|A|Y|I|N
Question 1- How can i suppress State and ACTV_IN from TGT_DF in the output and get below desired Output?
DIM_ID|Account_nbr|Location_cd|State|REF_IN
900000|1234567|1000|A|Y - (Type 1 update)
900001|3456789|2000|I|N - (Type 1 update)
900002|6789123|5000|A|Y - (New Insert-1st Occurance)
Question 2- Whats the best way to generate the new dim ID for new Inserts (Existing Max(dim_id) in target onwards)?
Also, i want this logic to be generic (to be used for other tables as well) which can be handled by three parameters - (src,tgt,join_key) or more if required.
Thanks, Sid
The Ideal function you desire would require a join , selecting required fields, separating joined dataframes for valid DIM_ID and null DIM_ID, populating null DIM_ID from the max DIM_ID, updating REF_IN column and finally merging both separated dataframes .
Above theory can be programmed as below (I have commented for clarification and you can make it more robust if you desire)
def func(src: DataFrame, trgt: DataFrame, join_key: Array[String], select_keys: Array[String]) ={
import org.apache.spark.sql.functions._
//joining and selecting appropriate columns
val select_columns = Array("trgt.DIM_ID") ++ select_keys.map(x => "src."+x)
val joined_df = src.as("src").join(trgt.as("trgt"), Seq(join_key: _*), "left")
.select(select_columns.map(col):_*)
//separating joined dataframe for populating DIM_ID for null values
val matched_df = joined_df.filter(col("DIM_ID").isNotNull)
val not_matched_df = joined_df.filter(col("DIM_ID").isNull)
//extracting the max DIM_ID for populating in the not-matched table
val max_dim_id = matched_df.select(max("DIM_ID")).take(1)(0).getAs[Int](0)
//generating DIM_ID for null values increasing from the max DIM_ID, which is expensive though
import org.apache.spark.sql.expressions._
val not_matched_df_with_id = not_matched_df.withColumn("DIM_ID", row_number().over(Window.orderBy("Account_nbr"))+max_dim_id)
//merge both separated dataframes and return with REF_IN column modified according the your desire
matched_df.withColumn("REF_IN", concat_ws(" - ", col("REF_IN"), lit("(Type 1 update)")))
.union(not_matched_df_with_id.withColumn("REF_IN", concat_ws(" - ", col("REF_IN"), lit("(New Insert-1st Occurance)"))))
}
Finally you call the function as
val select_columns = SRC_DF.columns
func(SRC_DF, TGT_DF, Array("Account_nbr","Location_cd"), select_columns)
.show(false)
which should give you your desired output dataframe
+------+-----------+-----------+--------+------------------------------+
|DIM_ID|Account_nbr|Location_cd|State_IN|REF_IN |
+------+-----------+-----------+--------+------------------------------+
|900000|1234567 |1000 |A |Y - (Type 1 update) |
|900001|3456789 |2000 |I |N - (Type 1 update) |
|900002|6789123 |5000 |A |Y - (New Insert-1st Occurance)|
+------+-----------+-----------+--------+------------------------------+
在Ramesh Maharjan答案中,如果您认为此操作( row_number()
)昂贵,请尝试以下更好的选择。
joined_df.withColumn("DIM_ID", coalesce($"DIM_ID" , lit(max_dim) + lit(1) + monotonically_increasing_id())) // monotonically_increasing_id() generates numbers from 0... hence adding lit(1) along with max_dim
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.