Spark scala 在 dataframe 内加入 dataframe

Question

I have a requirement where I need to join dataframes A and B and calculate a column and use that calculated value in another join between the same 2 dataframes with different Join conditions.我有一个要求，我需要连接数据框 A 和 B 并计算一列，并在具有不同连接条件的相同 2 个数据框之间的另一个连接中使用该计算值。

eg:例如：

 DF_Combined = A_DF.join(B_DF,'Join-Condition',"left_outer").withColumn(col1,'value')

after doing the above I need to do the same join but use the value calculated in the previous join.完成上述操作后，我需要进行相同的连接，但使用上一个连接中计算的值。

 DF_Final=A_DF.join(B_DF,'New join COndition',"left_outer").withcolumn(col2,DF_Combined.col1*vol1*10)

When I try to do this I get a Cartesian product issue.当我尝试这样做时，我遇到了笛卡尔积问题。

Answer 1

You can't use a column that is not present in dataframe.您不能使用 dataframe 中不存在的列。 I mean when you do A_DF.join(B_DF,... in the resulting dataframe you only have columns from A_DF and B_DF . If you want to have the new column - you need to use DF_Combined .我的意思是，当您在生成的 dataframe 中执行A_DF.join(B_DF,...时，您只有A_DF和B_DF的列。如果您想拥有新列 - 您需要使用DF_Combined 。

From your question i believe you don't need to have another join, but you have 2 possible options: 1. When you do first join - at this place calculate vol1*10 .根据您的问题，我相信您不需要再次加入，但您有 2 个可能的选择： 1. 当您第一次加入时 - 在这个地方计算vol1*10 。 2. After join do DF_Combined.withColumn... . 2.加入后做DF_Combined.withColumn...

But please remember - withColumn(name, expr) creates a column with a name setting value to result of expr .但请记住 - withColumn(name, expr)创建一个name设置值为expr结果的列。 So .withcolumn(DF_Combined.col1,vol1*10) does not make sense.所以.withcolumn(DF_Combined.col1,vol1*10)没有意义。

Spark scala 在 dataframe 内加入 dataframe

问题描述

1 个解决方案

解决方案1
0 2019-11-22 13:54:07

Spark scala 在 dataframe 内加入 dataframe

问题描述

1 个解决方案

解决方案1 0 2019-11-22 13:54:07

解决方案1
0 2019-11-22 13:54:07