Spark scala join dataframe within a dataframe

Question

I have a requirement where I need to join dataframes A and B and calculate a column and use that calculated value in another join between the same 2 dataframes with different Join conditions.

eg:

 DF_Combined = A_DF.join(B_DF,'Join-Condition',"left_outer").withColumn(col1,'value')

after doing the above I need to do the same join but use the value calculated in the previous join.

 DF_Final=A_DF.join(B_DF,'New join COndition',"left_outer").withcolumn(col2,DF_Combined.col1*vol1*10)

When I try to do this I get a Cartesian product issue.

Answer 1

You can't use a column that is not present in dataframe. I mean when you do A_DF.join(B_DF,... in the resulting dataframe you only have columns from A_DF and B_DF . If you want to have the new column - you need to use DF_Combined .

From your question i believe you don't need to have another join, but you have 2 possible options: 1. When you do first join - at this place calculate vol1*10 . 2. After join do DF_Combined.withColumn... .

But please remember - withColumn(name, expr) creates a column with a name setting value to result of expr . So .withcolumn(DF_Combined.col1,vol1*10) does not make sense.

Spark scala join dataframe within a dataframe

Question

1 answers

solution1
0 2019-11-22 13:54:07

Spark scala join dataframe within a dataframe

Question

1 answers

solution1 0 2019-11-22 13:54:07

solution1
0 2019-11-22 13:54:07