简体   繁体   中英

pyspark dataframe left join and add a new column with constant vlue

I would like to join two pyspark dataframe with conditions and also add a new column.

df1 = spark.createDataFrame(
     [(2010, 1, 'rdc', 'bdvs'), (2010, 1, 'rdc','yybp'),
      (2007, 6, 'utw', 'itcs'), (2007, 6, 'utw','tbsw')
     ], 
     ("year", "month", "u_id", "p_id"))

df2 = spark.createDataFrame(
     [(2010, 1, 'rdc', 'bdvs'),
      (2007, 6, 'utw', 'itcs')
     ], 
     ("year", "month", "u_id", "p_id"))

df1

 year month u_id p_id
 2010 1     rdc  bdvs
 2010 1     rdc  yybp
 2007 6     utw  ircs
 2007 6     utw  tbsw

df2

 year month u_id p_id
 2010 1     rdc  bdvs
 2007 6     utw  ircs
 

new df that I need:

 year month u_id p_id  is_true
 2010 1     rdc  bdvs     1
 2010 1     rdc  yybp     0
 2007 6     utw  ircs     1  
 2007 6     utw  tbsw     0

My python3 code:

 import pyspark.sql.functions as F
 t =df1.join(df2, (df1.year==df2.year) & (df1.month==df2.month) & (df1.u_id==df2.u_id), how='left').withColumn('is_true', F.when(df1.p_id==df2.p_id, F.lit(1)).otherWise(F.lit(0)))

I got error:

 TypeError: 'Column' object is not callable

I tried some solutions but none of them work.

Do I miss something? I try to add a constant as a new column value based on some conditions.

thanks

change otherWise to otherwise .

Example:

t =df1.alias("df1").join(df2.alias("df2"), (df1.year==df2.year) & (df1.month==df2.month) & (df1.u_id==df2.u_id), how='left').\
withColumn('is_true', F.when(df1.p_id == df2.p_id, F.lit(1)).otherwise(F.lit(0))).select("df1.*","is_true")

t.show()
#+----+-----+----+----+-------+
#|year|month|u_id|p_id|is_true|
#+----+-----+----+----+-------+
#|2007|    6| utw|itcs|      1|
#|2007|    6| utw|tbsw|      0|
#|2010|    1| rdc|bdvs|      1|
#|2010|    1| rdc|yybp|      0|
#+----+-----+----+----+-------+

Another way without using when statement would be using left_semi,left_anti .

from pyspark.sql.functions import *

columns=df1.columns

df1.\
join(df2,columns,'left_anti').\
withColumn("is_true",lit(1)).\
unionAll(df1.\
join(df2,columns,'left_semi').\
withColumn("is_true",lit(0))).\
show()
#+----+-----+----+----+-------+
#|year|month|u_id|p_id|is_true|
#+----+-----+----+----+-------+
#|2010|    1| rdc|yybp|      1|
#|2007|    6| utw|tbsw|      1|
#|2007|    6| utw|itcs|      0|
#|2010|    1| rdc|bdvs|      0|
#+----+-----+----+----+-------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM