简体   繁体   中英

PySpark assign names to a column values 'withcolumn'

I am new to PySaprk but have some experience with R.

Question: I wanted to assign a name to the height (numbers) listed in ONE column. I started writing code as below:

w = Window.partitionBy("student_id")
df_enc_hw = df_enc_hw.withColumn("stuname", \
                       when(lower(col("height")) <= 4, "under_ht") 
                      .when(lower(col("height")) > 4 < 5, "ok_ht")  
                      .when(lower(col("height")) >=5 < 6, "normal_ht")  
                      .when(lower(col("height")) >=6, "abnor_ht")) 

But the following error:

    633 
    634     def __nonzero__(self):
--> 635         raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
    636                          "'~' for 'not' when building DataFrame boolean expressions.")
    637     __bool__ = __nonzero__

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Thank you for your help K

You should split up your conditionals into separate expressions like this:

df_enc_hw = df_enc_hw.withColumn("stuname", \
                       when(lower(col("height")) <= 4, "under_ht") 
                      .when((lower(col("height")) > 4) & (lower(col("height")) < 5), "ok_ht")  
                      .when((lower(col("height")) >=5) & (lower(col("height")) < 6), "normal_ht")  
                      .when(lower(col("height")) >=6, "abnor_ht"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM