![](/img/trans.png)
[英]How to use join on 3 tables with conditions in pyspark? (Multiple tables)
[英]How to use join with many conditions in pyspark?
我可以將dataframe join語句與single on條件一起使用(在pyspark中),但是,如果我嘗試添加多個條件,則失敗。
代碼:
summary2 = summary.join(county_prop, ["category_id", "bucket"], how = "leftouter").
上面的代碼有效。 但是,如果我為列表添加其他一些條件,例如summary.bucket == 9之類的東西,它將失敗。 請幫助我解決此問題。
The error for the statement
summary2 = summary.join(county_prop, ["category_id", (summary.bucket)==9], how = "leftouter")
ERROR : TypeError: 'Column' object is not callable
編輯:
添加完整的工作示例。
schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
bucket_summary = sqlContext.createDataFrame([],schema)
temp_county_prop = sqlContext.createDataFrame([("nation","nation",1,222,444,555),("nation","state",2,222,444,555)],schema)
bucket_summary = bucket_summary.unionAll(temp_county_prop)
county_prop = sqlContext.createDataFrame([("nation","state",2,121,221,551)],schema)
想加入:
category_id和bucket列,我想替換bucket_summary上county_prop的值。
cond = [bucket_summary.bucket == county_prop.bucket, bucket_summary.bucket == 2]
bucket_summary2 = bucket_summary.join(county_prop,cond,how =“ leftouter”)
1. It works if I mention the whole statement with cols, but if I list conditions like ["category_id", "bucket"] --- THis too works.
2. But, if I use a combination of both like cond =["bucket", bucket_summary.category_id == "state"]
它不起作用。 2語句可能出什么問題?
例如
df1.join(df2, on=[df1['age'] == df2['age'], df1['sex'] == df2['sex']], how='left_outer')
但是在您的情況下, (summary.bucket)==9
不應顯示為聯接條件
更新:
在連接條件下,您可以使用Column join expression
的列表或 Column / column_name
的列表
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.