简体   繁体   English

如何使用条件更新 pyspark 中的 Dataframe 列?

[英]How to Update Dataframe column in pyspark with conditions?

I am trying to update one dataframe column based on some other column conditions.我正在尝试根据其他一些列条件更新一个 dataframe 列。 I have two columns in my dataframe DATE_JOINING AND BONUS我的 dataframe DATE_JOINING AND BONUS 中有两列

df=spark.sparkContext.parallelize([(23-08-2021,''),(12-11-2009,''),(24-09-2013,'')

df = spark.createDataFrame(rdd, schema=['DATE_JOINING', 'BONUS'])

Basically IN SQL we can write query as基本上 IN SQL 我们可以将查询写为

UPDATE EMPLOYEE.SALARY
SET BONUS= 'GIVE BONUS'
where DATE_JOINING < 09-01-2015

In Pyspark i am trying below code在 Pyspark 我正在尝试下面的代码

df=df.withColumn('SALARY',when(df.DATE_JOINING <'01-09-2018',"GIVE BONUS").otherwise(''))

But it is giving me data more year than 2018 year also which i dont want and not giving any result for Column less than 2018. How can i correct this code.但它给我的数据比 2018 年多,我也不想要,也没有给出小于 2018 年的列的任何结果。我如何更正此代码。

This is because your date column is not actually of date type, but string type.这是因为你的日期列实际上不是日期类型,而是字符串类型。
To perform operations on dates, you need to convert the column with to_date ;要对日期执行操作,您需要使用to_date转换列; furthermore, you must explicit dates through the format yyyy-mm-dd .此外,您必须通过yyyy-mm-dd格式明确日期。

import pyspark.sql.functions as F

df = (df
 .withColumn('DATE_JOINING', F.to_date('DATE_JOINING', format='dd-mm-yyyy'))
 .withColumn('SALARY', F.when(F.col('DATE_JOINING') < '2018-09-01', 'GIVE BONUS').otherwise(''))
)

df.show()
+------------+-----+----------+
|DATE_JOINING|BONUS|    SALARY|
+------------+-----+----------+
|  2021-01-23|     |          |
|  2009-01-12|     |GIVE BONUS|
|  2013-01-24|     |GIVE BONUS|
+------------+-----+----------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM