简体   繁体   中英

How to Modify a cell/s value based on a condition in Pyspark dataframe

I have a dataframe which has a few columns like below :

category| category_id|    bucket| prop_count| event_count |   accum_prop_count |  accum_event_count
-----------------------------------------------------------------------------------------------------
nation  |   nation     |    1     | 222       |     444     |   555              |  6677

This dataframe starts from 0 rows and each function of my script adds a row to this.

There is a function which needs to modify 1 or 2 cell values based on condition. How to do this?

Code:

schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
a_df = sqlContext.createDataFrame([],schema)

a_temp = sqlContext.createDataFrame([("nation","nation",1,222,444,555)],schema)
a_df = a_df.unionAll(a_temp)

Rows added from some other function:

a_temp3 = sqlContext.createDataFrame([("nation","state",2,222,444,555)],schema)
a_df = a_df.unionAll(a_temp3)

Now to modify, I am trying a join with a condition.

a_temp4 = sqlContext.createDataFrame([("state","state",2,444,555,666)],schema)
a_df = a_df.join(a_temp4, [(a_df.category_id == a_temp4.category_id) & (some other cond here)], how = "inner")

But this code is not working. I am getting an error:

+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
|category|category_id|bucket|prop_count|event_count|accum_prop_count|category|category_id|bucket|prop_count|event_count|accum_prop_count|
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
|  nation|      state|     2|       222|        444|             555|   state|      state|     2|       444|        555|             666|
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+

How to fix this? Correct output should have 2 rows and the second row should have an updated value

1). An inner join will delete rows from your initial dataframe, if you want to have the same number of lines as a_df (on the left) you need a left join.

2). an == condition will duplicate columns if your columns have the same names you can use a list instead.

3). I imagine "some other condition" refers to bucket

4). You want to keep the value from a_temp4 if it exists (the join will set its values at null if it doesn't), psf.coalesce allows you to do this

import pyspark.sql.functions as psf
a_df = a_df.join(a_temp4, ["category_id", "bucket"], how="leftouter").select(
    psf.coalesce(a_temp4.category, a_df.category).alias("category"), 
    "category_id", 
    "bucket", 
    psf.coalesce(a_temp4.prop_count, a_df.prop_count).alias("prop_count"), 
    psf.coalesce(a_temp4.event_count, a_df.event_count).alias("event_count"), 
    psf.coalesce(a_temp4.accum_prop_count, a_df.accum_prop_count).alias("accum_prop_count")
    )

+--------+-----------+------+----------+-----------+----------------+
|category|category_id|bucket|prop_count|event_count|accum_prop_count|
+--------+-----------+------+----------+-----------+----------------+
|   state|      state|     2|       444|        555|             666|
|  nation|     nation|     1|       222|        444|             555|
+--------+-----------+------+----------+-----------+----------------+

If you only work with one-line dataframes you should consider coding the update directly instead of using join:

def update_col(category_id, bucket, col_name, col_val):
    return psf.when((a_df.category_id == category_id) & (a_df.bucket == bucket), col_val).otherwise(a_df[col_name]).alias(col_name)

a_df.select(
    update_col("state", 2, "category", "nation"), 
    "category_id", 
    "bucket", 
    update_col("state", 2, "prop_count", 444), 
    update_col("state", 2, "event_count", 555), 
    update_col("state", 2, "accum_prop_count", 666)
).show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM