简体   繁体   English

如何根据Pyspark数据框中的条件修改单元格值

[英]How to Modify a cell/s value based on a condition in Pyspark dataframe

I have a dataframe which has a few columns like below : 我有一个数据框,其中包含如下几列:

category| category_id|    bucket| prop_count| event_count |   accum_prop_count |  accum_event_count
-----------------------------------------------------------------------------------------------------
nation  |   nation     |    1     | 222       |     444     |   555              |  6677

This dataframe starts from 0 rows and each function of my script adds a row to this. 此数据帧从0行开始,我脚本的每个函数都为此添加一行。

There is a function which needs to modify 1 or 2 cell values based on condition. 有一项功能需要根据条件修改1或2个单元格值。 How to do this? 这个怎么做?

Code: 码:

schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
a_df = sqlContext.createDataFrame([],schema)

a_temp = sqlContext.createDataFrame([("nation","nation",1,222,444,555)],schema)
a_df = a_df.unionAll(a_temp)

Rows added from some other function: 从其他功能添加的行:

a_temp3 = sqlContext.createDataFrame([("nation","state",2,222,444,555)],schema)
a_df = a_df.unionAll(a_temp3)

Now to modify, I am trying a join with a condition. 现在要修改,我正在尝试使用条件连接。

a_temp4 = sqlContext.createDataFrame([("state","state",2,444,555,666)],schema)
a_df = a_df.join(a_temp4, [(a_df.category_id == a_temp4.category_id) & (some other cond here)], how = "inner")

But this code is not working. 但是此代码不起作用。 I am getting an error: 我收到一个错误:

+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
|category|category_id|bucket|prop_count|event_count|accum_prop_count|category|category_id|bucket|prop_count|event_count|accum_prop_count|
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
|  nation|      state|     2|       222|        444|             555|   state|      state|     2|       444|        555|             666|
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+

How to fix this? 如何解决这个问题? Correct output should have 2 rows and the second row should have an updated value 正确的输出应具有2行,第二行应具有更新的值

1). 1)。 An inner join will delete rows from your initial dataframe, if you want to have the same number of lines as a_df (on the left) you need a left join. 内部a_df将从您的初始数据a_df删除行,如果要与a_df (在左侧)具有相同的行数,则需要左a_df

2). 2)。 an == condition will duplicate columns if your columns have the same names you can use a list instead. 如果您的列名与您可以使用列表相同,则==条件将重复列。

3). 3)。 I imagine "some other condition" refers to bucket 我想象“其他情况”是指bucket

4). 4)。 You want to keep the value from a_temp4 if it exists (the join will set its values at null if it doesn't), psf.coalesce allows you to do this 您希望保留a_temp4中的值(如果存在)(如果不存在,联接会将其值设置为null), psf.coalesce允许您执行此操作

import pyspark.sql.functions as psf
a_df = a_df.join(a_temp4, ["category_id", "bucket"], how="leftouter").select(
    psf.coalesce(a_temp4.category, a_df.category).alias("category"), 
    "category_id", 
    "bucket", 
    psf.coalesce(a_temp4.prop_count, a_df.prop_count).alias("prop_count"), 
    psf.coalesce(a_temp4.event_count, a_df.event_count).alias("event_count"), 
    psf.coalesce(a_temp4.accum_prop_count, a_df.accum_prop_count).alias("accum_prop_count")
    )

+--------+-----------+------+----------+-----------+----------------+
|category|category_id|bucket|prop_count|event_count|accum_prop_count|
+--------+-----------+------+----------+-----------+----------------+
|   state|      state|     2|       444|        555|             666|
|  nation|     nation|     1|       222|        444|             555|
+--------+-----------+------+----------+-----------+----------------+

If you only work with one-line dataframes you should consider coding the update directly instead of using join: 如果仅使用单行数据帧,则应考虑直接对更新进行编码,而不要使用join:

def update_col(category_id, bucket, col_name, col_val):
    return psf.when((a_df.category_id == category_id) & (a_df.bucket == bucket), col_val).otherwise(a_df[col_name]).alias(col_name)

a_df.select(
    update_col("state", 2, "category", "nation"), 
    "category_id", 
    "bucket", 
    update_col("state", 2, "prop_count", 444), 
    update_col("state", 2, "event_count", 555), 
    update_col("state", 2, "accum_prop_count", 666)
).show()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据条件使用来自另一个 dataframe 的单个值修改大 pyspark dataframe 的最佳实践 - Best practice to modify a big pyspark dataframe with single value from another dataframe based on a condition 如何根据 PySpark 中的条件修改行的子集 - How to modify a subset of rows based on condition in PySpark 如何根据 dataframe 中的某些条件修改列值? - How to modify column value based on some condition in a dataframe? PySpark DataFrame:根据另一列中的最小/最大条件更改单元格值 - PySpark DataFrame: Change cell value based on min/max condition in another column 如何根据条件在 Pandas dataframe 的单个单元格中列出 append 值 - How to append value to list in single cell in Pandas dataframe based on condition 如何根据两个条件将数据框的单元格值相乘? - How can i multiply a cell value of a dataframe based on two condition? 如何根据条件为具有缺失值的 pandas dataframe 单元格分配另一个单元格的值? - How do I assign a value to a pandas dataframe cell with a missing value with the value of another cell based on a condition? 根据条件从 PySpark DataFrame 列获取下一个值 - Get next value from a PySpark DataFrame column based on condition 根据条件修改数据框中的2列 - Modify 2 columns in dataframe based on condition 如何根据条件在 Pandas DataFrame 的单元格中的字典中创建新的键值对 - How to create a new key-value pair in a dictionary in a cell in Pandas DataFrame based on a condition
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM