[英]pyspark dataframe left join and add a new column with constant vlue
I would like to join two pyspark dataframe with conditions and also add a new column.我想用条件加入两个 pyspark dataframe 并添加一个新列。
df1 = spark.createDataFrame(
[(2010, 1, 'rdc', 'bdvs'), (2010, 1, 'rdc','yybp'),
(2007, 6, 'utw', 'itcs'), (2007, 6, 'utw','tbsw')
],
("year", "month", "u_id", "p_id"))
df2 = spark.createDataFrame(
[(2010, 1, 'rdc', 'bdvs'),
(2007, 6, 'utw', 'itcs')
],
("year", "month", "u_id", "p_id"))
df1 df1
year month u_id p_id
2010 1 rdc bdvs
2010 1 rdc yybp
2007 6 utw ircs
2007 6 utw tbsw
df2 df2
year month u_id p_id
2010 1 rdc bdvs
2007 6 utw ircs
new df that I need:我需要的新df:
year month u_id p_id is_true
2010 1 rdc bdvs 1
2010 1 rdc yybp 0
2007 6 utw ircs 1
2007 6 utw tbsw 0
My python3 code:我的python3代码:
import pyspark.sql.functions as F
t =df1.join(df2, (df1.year==df2.year) & (df1.month==df2.month) & (df1.u_id==df2.u_id), how='left').withColumn('is_true', F.when(df1.p_id==df2.p_id, F.lit(1)).otherWise(F.lit(0)))
I got error:我收到错误:
TypeError: 'Column' object is not callable
I tried some solutions but none of them work.我尝试了一些解决方案,但没有一个有效。
Do I miss something?我错过了什么吗? I try to add a constant as a new column value based on some conditions.
我尝试根据某些条件添加一个常量作为新列值。
thanks谢谢
change otherWise
to otherwise
.将
otherWise
更改为otherwise
。
Example:
t =df1.alias("df1").join(df2.alias("df2"), (df1.year==df2.year) & (df1.month==df2.month) & (df1.u_id==df2.u_id), how='left').\
withColumn('is_true', F.when(df1.p_id == df2.p_id, F.lit(1)).otherwise(F.lit(0))).select("df1.*","is_true")
t.show()
#+----+-----+----+----+-------+
#|year|month|u_id|p_id|is_true|
#+----+-----+----+----+-------+
#|2007| 6| utw|itcs| 1|
#|2007| 6| utw|tbsw| 0|
#|2010| 1| rdc|bdvs| 1|
#|2010| 1| rdc|yybp| 0|
#+----+-----+----+----+-------+
Another way without using when statement
would be using left_semi,left_anti
.不使用
when statement
的另一种方法是使用left_semi,left_anti
。
from pyspark.sql.functions import *
columns=df1.columns
df1.\
join(df2,columns,'left_anti').\
withColumn("is_true",lit(1)).\
unionAll(df1.\
join(df2,columns,'left_semi').\
withColumn("is_true",lit(0))).\
show()
#+----+-----+----+----+-------+
#|year|month|u_id|p_id|is_true|
#+----+-----+----+----+-------+
#|2010| 1| rdc|yybp| 1|
#|2007| 6| utw|tbsw| 1|
#|2007| 6| utw|itcs| 0|
#|2010| 1| rdc|bdvs| 0|
#+----+-----+----+----+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.