简体   繁体   English

pyspark 加入 2 个查找表

[英]pyspark join with 2 lookup tables

I have one sales data and product details in two lookup table我在两个查找表中有一个销售数据和产品详细信息

df_prod_lookup1 df_prod_lookup1

ID     product     description
1      cereal      Minipack
2      canola      bottle
4      rice        bag

df_prod_lookup2 df_prod_lookup2

ID     product     description
6      glass       bottle
8      plants      hibiscus
10     tree        banyan

sales_df sales_df

ID     product     
10     tree        
1      cereal      
4      rice        
8      plants 

Expected output:预计 output:

ID     product     description
10     tree        banyan
1      cereal      Minipack
4      rice        bag
8      plants      hibiscus

I am supposed to use lookup table 1 and later lookup table 2 if ID is not available in lookup table 1如果 ID 在查找表 1 中不可用,我应该使用查找表 1 和后来的查找表 2

lookup table 1 and 2 are of different column names and can not be merged as one.查找表 1 和 2 的列名不同,不能合并为一个。 Is tehre a way to check if ID is available in lookuptable 1 and do the join if not then lookup table 2 for every record in the sales?是否有一种方法可以检查 ID 在查找表 1 中是否可用,如果没有则进行连接,然后为销售中的每条记录查找表 2? Thanks.谢谢。

I could do only simple join with one lookup table.我只能用一个查找表进行简单的连接。

df_final = sales_df.join(df_prod_lookup1 on=['ID'], how='left')

Regards问候

Left join first with lookup table 1, and then with lookup table 2.先左连接查找表 1,然后左连接查找表 2。
The coalesce function allows you to merge the description fields. coalesce function 允许您合并description字段。

df_prod_lookup1 = df_prod_lookup1.withColumnRenamed("product", "product1").withColumnRenamed("description", "description1")
df_prod_lookup2 = df_prod_lookup2.withColumnRenamed("product", "product2").withColumnRenamed("description", "description2")

from pyspark.sql.functions import coalesce

# Edit based on comments #
sales_df.join(df_prod_lookup1, on=['ID'], how='left')\
        .join(df_prod_lookup2, on=['ID'], how='left')\
        .withColumn('product', coalesce('product1', 'product2'))\
        .withColumn('description', coalesce('description1', 'description2'))\
        .drop('product1', 'product2', 'description1', 'description2').show()

+---+-------+-----------+
| ID|product|description|
+---+-------+-----------+
|  8| plants|   hibiscus|
|  1| cereal|   Minipack|
| 10|   tree|     banyan|
|  4|   rice|        bag|
+---+-------+-----------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM