简体   繁体   中英

Pyspark - withColumn + when with variable give "Method or([class java.lang.Boolean]) does not exist"

I need to add a column to data frame based on the one of the other columns AND a variable value (represented here as otherThing ), see below:

otherThing = "test"
dataDF = spark.createDataFrame([(66, "a", "4"), 
                                (67, "a", "0"), 
                                (70, "b", "4"), 
                                (71, "d", "4")],
                                ("id", "code", "amt"))
#this works fine
dataDF.withColumn("new_column", when((dataDF["id"] <= 70), "A").otherwise("B")).display() 
#this gives me error
dataDF.withColumn("new_column", when((dataDF["id"] <= 70) | (otherThing == ""), "A").otherwise("B")).display()

This returns the following error: Method or([class java.lang.Boolean]) does not exist In the example otherThing is constant, but in real scenario it can have different values

The issue is due to the missing lit for the variable

https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.lit.html https://sparkbyexamples.com/pyspark/pyspark-lit-add-literal-constant/

working code:

import pyspark.sql.functions as F
otherThing = ""
dataDF = spark.createDataFrame([(66, "a", "4"), 
                                (67, "a", "0"), 
                                (70, "b", "4"), 
                                (71, "d", "4")],
                                ("id", "code", "amt"))
dataDF.withColumn("new_column", when((dataDF["id"] <= 70) | F.lit(otherThing == ""), "A").otherwise("B")).display()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM