I have one sales data and product details in two lookup table
df_prod_lookup1
ID product description
1 cereal Minipack
2 canola bottle
4 rice bag
df_prod_lookup2
ID product description
6 glass bottle
8 plants hibiscus
10 tree banyan
sales_df
ID product
10 tree
1 cereal
4 rice
8 plants
Expected output:
ID product description
10 tree banyan
1 cereal Minipack
4 rice bag
8 plants hibiscus
I am supposed to use lookup table 1 and later lookup table 2 if ID is not available in lookup table 1
lookup table 1 and 2 are of different column names and can not be merged as one. Is tehre a way to check if ID is available in lookuptable 1 and do the join if not then lookup table 2 for every record in the sales? Thanks.
I could do only simple join with one lookup table.
df_final = sales_df.join(df_prod_lookup1 on=['ID'], how='left')
Regards
Left join first with lookup table 1, and then with lookup table 2.
The coalesce
function allows you to merge the description
fields.
df_prod_lookup1 = df_prod_lookup1.withColumnRenamed("product", "product1").withColumnRenamed("description", "description1")
df_prod_lookup2 = df_prod_lookup2.withColumnRenamed("product", "product2").withColumnRenamed("description", "description2")
from pyspark.sql.functions import coalesce
# Edit based on comments #
sales_df.join(df_prod_lookup1, on=['ID'], how='left')\
.join(df_prod_lookup2, on=['ID'], how='left')\
.withColumn('product', coalesce('product1', 'product2'))\
.withColumn('description', coalesce('description1', 'description2'))\
.drop('product1', 'product2', 'description1', 'description2').show()
+---+-------+-----------+
| ID|product|description|
+---+-------+-----------+
| 8| plants| hibiscus|
| 1| cereal| Minipack|
| 10| tree| banyan|
| 4| rice| bag|
+---+-------+-----------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.