如何根据Pyspark数据框中的条件修改单元格值

Question

I have a dataframe which has a few columns like below : 我有一个数据框，其中包含如下几列：

category| category_id|    bucket| prop_count| event_count |   accum_prop_count |  accum_event_count
-----------------------------------------------------------------------------------------------------
nation  |   nation     |    1     | 222       |     444     |   555              |  6677

This dataframe starts from 0 rows and each function of my script adds a row to this. 此数据帧从0行开始，我脚本的每个函数都为此添加一行。

There is a function which needs to modify 1 or 2 cell values based on condition. 有一项功能需要根据条件修改1或2个单元格值。 How to do this? 这个怎么做？

Code: 码：

schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
a_df = sqlContext.createDataFrame([],schema)

a_temp = sqlContext.createDataFrame([("nation","nation",1,222,444,555)],schema)
a_df = a_df.unionAll(a_temp)

Rows added from some other function: 从其他功能添加的行：

a_temp3 = sqlContext.createDataFrame([("nation","state",2,222,444,555)],schema)
a_df = a_df.unionAll(a_temp3)

Now to modify, I am trying a join with a condition. 现在要修改，我正在尝试使用条件连接。

a_temp4 = sqlContext.createDataFrame([("state","state",2,444,555,666)],schema)
a_df = a_df.join(a_temp4, [(a_df.category_id == a_temp4.category_id) & (some other cond here)], how = "inner")

But this code is not working. 但是此代码不起作用。 I am getting an error: 我收到一个错误：

+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
|category|category_id|bucket|prop_count|event_count|accum_prop_count|category|category_id|bucket|prop_count|event_count|accum_prop_count|
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
|  nation|      state|     2|       222|        444|             555|   state|      state|     2|       444|        555|             666|
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+

How to fix this? 如何解决这个问题？ Correct output should have 2 rows and the second row should have an updated value 正确的输出应具有2行，第二行应具有更新的值

Answer 1

1). 1）。 An inner join will delete rows from your initial dataframe, if you want to have the same number of lines as a_df (on the left) you need a left join. 内部a_df将从您的初始数据a_df删除行，如果要与a_df （在左侧）具有相同的行数，则需要左a_df 。

2). 2）。 an == condition will duplicate columns if your columns have the same names you can use a list instead. 如果您的列名与您可以使用列表相同，则==条件将重复列。

3). 3）。 I imagine "some other condition" refers to bucket 我想象“其他情况”是指bucket

4). 4）。 You want to keep the value from a_temp4 if it exists (the join will set its values at null if it doesn't), psf.coalesce allows you to do this 您希望保留a_temp4中的值（如果存在）（如果不存在，联接会将其值设置为null）， psf.coalesce允许您执行此操作

import pyspark.sql.functions as psf
a_df = a_df.join(a_temp4, ["category_id", "bucket"], how="leftouter").select(
    psf.coalesce(a_temp4.category, a_df.category).alias("category"), 
    "category_id", 
    "bucket", 
    psf.coalesce(a_temp4.prop_count, a_df.prop_count).alias("prop_count"), 
    psf.coalesce(a_temp4.event_count, a_df.event_count).alias("event_count"), 
    psf.coalesce(a_temp4.accum_prop_count, a_df.accum_prop_count).alias("accum_prop_count")
    )

+--------+-----------+------+----------+-----------+----------------+
|category|category_id|bucket|prop_count|event_count|accum_prop_count|
+--------+-----------+------+----------+-----------+----------------+
|   state|      state|     2|       444|        555|             666|
|  nation|     nation|     1|       222|        444|             555|
+--------+-----------+------+----------+-----------+----------------+

If you only work with one-line dataframes you should consider coding the update directly instead of using join: 如果仅使用单行数据帧，则应考虑直接对更新进行编码，而不要使用join：

def update_col(category_id, bucket, col_name, col_val):
    return psf.when((a_df.category_id == category_id) & (a_df.bucket == bucket), col_val).otherwise(a_df[col_name]).alias(col_name)

a_df.select(
    update_col("state", 2, "category", "nation"), 
    "category_id", 
    "bucket", 
    update_col("state", 2, "prop_count", 444), 
    update_col("state", 2, "event_count", 555), 
    update_col("state", 2, "accum_prop_count", 666)
).show()

如何根据Pyspark数据框中的条件修改单元格值

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-08-21 13:26:26

如何根据Pyspark数据框中的条件修改单元格值

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-08-21 13:26:26

解决方案1
2 已采纳 2017-08-21 13:26:26