Hi I know this is a basic question but I'm new to Foundry and Pyspark, please help! I need to JOIN two datasets in a Code Workbook of Palantir Foundry using 3 columns (two are named the same in both but one uses a different name within the datasets) I'm not sure how to do this. Thank you for your help!
According to the pyspark documentation , you can use a list of columns for the "on" argument (the join keys). If you were joining two datasets (df1 & df2), where df1 had join keys ["a", "b", "c"] and df2 had join keys ["a", "b", "c2"], I would do something like this:
df1.join(df2.withColumnRenamed("c2", "c"), on=["a", "b", "c"], how="left")
As per the PySpark documentation that @kate provided, you just need to specify either
date
column in table A is in between date_before
and date_after
in table B. This would look something like df_a.join(df_b, on=((df_a.date < df_b.date_after) & (df_a.data > df_b.date_before)))
so you have a lot of flexibility here in terms of how you can join datasets
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.