简体   繁体   English

pyspark dataframe 左连接并添加一个新的具有恒定值的列

[英]pyspark dataframe left join and add a new column with constant vlue

I would like to join two pyspark dataframe with conditions and also add a new column.我想用条件加入两个 pyspark dataframe 并添加一个新列。

df1 = spark.createDataFrame(
     [(2010, 1, 'rdc', 'bdvs'), (2010, 1, 'rdc','yybp'),
      (2007, 6, 'utw', 'itcs'), (2007, 6, 'utw','tbsw')
     ], 
     ("year", "month", "u_id", "p_id"))

df2 = spark.createDataFrame(
     [(2010, 1, 'rdc', 'bdvs'),
      (2007, 6, 'utw', 'itcs')
     ], 
     ("year", "month", "u_id", "p_id"))

df1 df1

 year month u_id p_id
 2010 1     rdc  bdvs
 2010 1     rdc  yybp
 2007 6     utw  ircs
 2007 6     utw  tbsw

df2 df2

 year month u_id p_id
 2010 1     rdc  bdvs
 2007 6     utw  ircs
 

new df that I need:我需要的新df:

 year month u_id p_id  is_true
 2010 1     rdc  bdvs     1
 2010 1     rdc  yybp     0
 2007 6     utw  ircs     1  
 2007 6     utw  tbsw     0

My python3 code:我的python3代码:

 import pyspark.sql.functions as F
 t =df1.join(df2, (df1.year==df2.year) & (df1.month==df2.month) & (df1.u_id==df2.u_id), how='left').withColumn('is_true', F.when(df1.p_id==df2.p_id, F.lit(1)).otherWise(F.lit(0)))

I got error:我收到错误:

 TypeError: 'Column' object is not callable

I tried some solutions but none of them work.我尝试了一些解决方案,但没有一个有效。

Do I miss something?我错过了什么吗? I try to add a constant as a new column value based on some conditions.我尝试根据某些条件添加一个常量作为新列值。

thanks谢谢

change otherWise to otherwise .otherWise更改为otherwise

Example:

t =df1.alias("df1").join(df2.alias("df2"), (df1.year==df2.year) & (df1.month==df2.month) & (df1.u_id==df2.u_id), how='left').\
withColumn('is_true', F.when(df1.p_id == df2.p_id, F.lit(1)).otherwise(F.lit(0))).select("df1.*","is_true")

t.show()
#+----+-----+----+----+-------+
#|year|month|u_id|p_id|is_true|
#+----+-----+----+----+-------+
#|2007|    6| utw|itcs|      1|
#|2007|    6| utw|tbsw|      0|
#|2010|    1| rdc|bdvs|      1|
#|2010|    1| rdc|yybp|      0|
#+----+-----+----+----+-------+

Another way without using when statement would be using left_semi,left_anti .不使用when statement的另一种方法是使用left_semi,left_anti

from pyspark.sql.functions import *

columns=df1.columns

df1.\
join(df2,columns,'left_anti').\
withColumn("is_true",lit(1)).\
unionAll(df1.\
join(df2,columns,'left_semi').\
withColumn("is_true",lit(0))).\
show()
#+----+-----+----+----+-------+
#|year|month|u_id|p_id|is_true|
#+----+-----+----+----+-------+
#|2010|    1| rdc|yybp|      1|
#|2007|    6| utw|tbsw|      1|
#|2007|    6| utw|itcs|      0|
#|2010|    1| rdc|bdvs|      0|
#+----+-----+----+----+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM