如何在pyspark中使用具有許多條件的join？

Question

我可以將dataframe join語句與single on條件一起使用（在pyspark中），但是，如果我嘗試添加多個條件，則失敗。

代碼：

   summary2 = summary.join(county_prop, ["category_id", "bucket"], how = "leftouter").

上面的代碼有效。 但是，如果我為列表添加其他一些條件，例如summary.bucket == 9之類的東西，它將失敗。 請幫助我解決此問題。

   The error for the statement 
   summary2 = summary.join(county_prop, ["category_id", (summary.bucket)==9], how = "leftouter")

   ERROR : TypeError: 'Column' object is not callable

編輯：

添加完整的工作示例。

   schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
   bucket_summary = sqlContext.createDataFrame([],schema)

   temp_county_prop = sqlContext.createDataFrame([("nation","nation",1,222,444,555),("nation","state",2,222,444,555)],schema)
   bucket_summary = bucket_summary.unionAll(temp_county_prop)
   county_prop = sqlContext.createDataFrame([("nation","state",2,121,221,551)],schema)

想加入：

category_id和bucket列，我想替換bucket_summary上county_prop的值。

   cond = [bucket_summary.bucket == county_prop.bucket, bucket_summary.bucket == 2]

bucket_summary2 = bucket_summary.join（county_prop，cond，how =“ leftouter”）

   1. It works if I mention the whole statement with cols, but if I list conditions like ["category_id", "bucket"]  --- THis too works.

   2. But, if I use a combination of both like cond =["bucket", bucket_summary.category_id == "state"]

它不起作用。 2語句可能出什么問題？

Answer 1

例如

df1.join(df2, on=[df1['age'] == df2['age'], df1['sex'] == df2['sex']], how='left_outer')

但是在您的情況下， (summary.bucket)==9不應顯示為聯接條件

更新：

在連接條件下，您可以使用Column join expression的列表或 Column / column_name的列表

如何在pyspark中使用具有許多條件的join？

問題描述

1 個解決方案

解決方案1
2 2017-08-22 08:53:08

如何在pyspark中使用具有許多條件的join？

問題描述

1 個解決方案

解決方案1 2 2017-08-22 08:53:08

解決方案1
2 2017-08-22 08:53:08