简体   繁体   English

如何在pyspark中使用具有许多条件的join?

[英]How to use join with many conditions in pyspark?

I am able to use the dataframe join statement with single on condition ( in pyspark) But, if I try to add multiple conditions, then It is failing. 我可以将dataframe join语句与single on条件一起使用(在pyspark中),但是,如果我尝试添加多个条件,则失败。

Code : 代码:

   summary2 = summary.join(county_prop, ["category_id", "bucket"], how = "leftouter").

The above code works. 上面的代码有效。 However If I add some other condition for list like, summary.bucket == 9 or something, it fails. 但是,如果我为列表添加其他一些条件,例如summary.bucket == 9之类的东西,它将失败。 Please help me fix this issue. 请帮助我解决此问题。

   The error for the statement 
   summary2 = summary.join(county_prop, ["category_id", (summary.bucket)==9], how = "leftouter")

   ERROR : TypeError: 'Column' object is not callable

Edit : 编辑:

Adding full working example. 添加完整的工作示例。

   schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
   bucket_summary = sqlContext.createDataFrame([],schema)

   temp_county_prop = sqlContext.createDataFrame([("nation","nation",1,222,444,555),("nation","state",2,222,444,555)],schema)
   bucket_summary = bucket_summary.unionAll(temp_county_prop)
   county_prop = sqlContext.createDataFrame([("nation","state",2,121,221,551)],schema)

Want to do a join on : 想加入:

category_id and bucket columns, I want to replace the values of county_prop on bucket_summary. category_id和bucket列,我想替换bucket_summary上county_prop的值。

   cond = [bucket_summary.bucket == county_prop.bucket, bucket_summary.bucket == 2]

bucket_summary2 = bucket_summary.join(county_prop, cond, how = "leftouter") bucket_summary2 = bucket_summary.join(county_prop,cond,how =“ leftouter”)

   1. It works if I mention the whole statement with cols, but if I list conditions like ["category_id", "bucket"]  --- THis too works.

   2. But, if I use a combination of both like cond =["bucket", bucket_summary.category_id == "state"] 

It is not working. 它不起作用。 What can go wrong with the 2 statement? 2语句可能出什么问题?

eg 例如

df1.join(df2, on=[df1['age'] == df2['age'], df1['sex'] == df2['sex']], how='left_outer')

But in your case, (summary.bucket)==9 should not appear as join condition 但是在您的情况下, (summary.bucket)==9不应显示为联接条件

UPDATE: 更新:

In join condition you can use a list of Column join expression or a list of Column / column_name 连接条件下,您可以使用Column join expression的列表 Column / column_name的列表

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM