I would like to join two pyspark dataframe with conditions and also add a new column.
df1 = spark.createDataFrame(
[(2010, 1, 'rdc', 'bdvs'), (2010, 1, 'rdc','yybp'),
(2007, 6, 'utw', 'itcs'), (2007, 6, 'utw','tbsw')
],
("year", "month", "u_id", "p_id"))
df2 = spark.createDataFrame(
[(2010, 1, 'rdc', 'bdvs'),
(2007, 6, 'utw', 'itcs')
],
("year", "month", "u_id", "p_id"))
df1
year month u_id p_id
2010 1 rdc bdvs
2010 1 rdc yybp
2007 6 utw ircs
2007 6 utw tbsw
df2
year month u_id p_id
2010 1 rdc bdvs
2007 6 utw ircs
new df that I need:
year month u_id p_id is_true
2010 1 rdc bdvs 1
2010 1 rdc yybp 0
2007 6 utw ircs 1
2007 6 utw tbsw 0
My python3 code:
import pyspark.sql.functions as F
t =df1.join(df2, (df1.year==df2.year) & (df1.month==df2.month) & (df1.u_id==df2.u_id), how='left').withColumn('is_true', F.when(df1.p_id==df2.p_id, F.lit(1)).otherWise(F.lit(0)))
I got error:
TypeError: 'Column' object is not callable
I tried some solutions but none of them work.
Do I miss something? I try to add a constant as a new column value based on some conditions.
thanks
change otherWise
to otherwise
.
Example:
t =df1.alias("df1").join(df2.alias("df2"), (df1.year==df2.year) & (df1.month==df2.month) & (df1.u_id==df2.u_id), how='left').\
withColumn('is_true', F.when(df1.p_id == df2.p_id, F.lit(1)).otherwise(F.lit(0))).select("df1.*","is_true")
t.show()
#+----+-----+----+----+-------+
#|year|month|u_id|p_id|is_true|
#+----+-----+----+----+-------+
#|2007| 6| utw|itcs| 1|
#|2007| 6| utw|tbsw| 0|
#|2010| 1| rdc|bdvs| 1|
#|2010| 1| rdc|yybp| 0|
#+----+-----+----+----+-------+
Another way without using when statement
would be using left_semi,left_anti
.
from pyspark.sql.functions import *
columns=df1.columns
df1.\
join(df2,columns,'left_anti').\
withColumn("is_true",lit(1)).\
unionAll(df1.\
join(df2,columns,'left_semi').\
withColumn("is_true",lit(0))).\
show()
#+----+-----+----+----+-------+
#|year|month|u_id|p_id|is_true|
#+----+-----+----+----+-------+
#|2010| 1| rdc|yybp| 1|
#|2007| 6| utw|tbsw| 1|
#|2007| 6| utw|itcs| 0|
#|2010| 1| rdc|bdvs| 0|
#+----+-----+----+----+-------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.