[英]How to Modify a cell/s value based on a condition in Pyspark dataframe
I have a dataframe which has a few columns like below : 我有一个数据框,其中包含如下几列:
category| category_id| bucket| prop_count| event_count | accum_prop_count | accum_event_count ----------------------------------------------------------------------------------------------------- nation | nation | 1 | 222 | 444 | 555 | 6677
This dataframe starts from 0 rows and each function of my script adds a row to this. 此数据帧从0行开始,我脚本的每个函数都为此添加一行。
There is a function which needs to modify 1 or 2 cell values based on condition. 有一项功能需要根据条件修改1或2个单元格值。 How to do this?
这个怎么做?
Code: 码:
schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
a_df = sqlContext.createDataFrame([],schema)
a_temp = sqlContext.createDataFrame([("nation","nation",1,222,444,555)],schema)
a_df = a_df.unionAll(a_temp)
Rows added from some other function: 从其他功能添加的行:
a_temp3 = sqlContext.createDataFrame([("nation","state",2,222,444,555)],schema)
a_df = a_df.unionAll(a_temp3)
Now to modify, I am trying a join with a condition. 现在要修改,我正在尝试使用条件连接。
a_temp4 = sqlContext.createDataFrame([("state","state",2,444,555,666)],schema)
a_df = a_df.join(a_temp4, [(a_df.category_id == a_temp4.category_id) & (some other cond here)], how = "inner")
But this code is not working. 但是此代码不起作用。 I am getting an error:
我收到一个错误:
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+ |category|category_id|bucket|prop_count|event_count|accum_prop_count|category|category_id|bucket|prop_count|event_count|accum_prop_count| +--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+ | nation| state| 2| 222| 444| 555| state| state| 2| 444| 555| 666| +--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
How to fix this? 如何解决这个问题? Correct output should have 2 rows and the second row should have an updated value
正确的输出应具有2行,第二行应具有更新的值
1). 1)。 An inner join will delete rows from your initial dataframe, if you want to have the same number of lines as
a_df
(on the left) you need a left join. 内部
a_df
将从您的初始数据a_df
删除行,如果要与a_df
(在左侧)具有相同的行数,则需要左a_df
。
2). 2)。 an
==
condition will duplicate columns if your columns have the same names you can use a list instead. 如果您的列名与您可以使用列表相同,则
==
条件将重复列。
3). 3)。 I imagine "some other condition" refers to
bucket
我想象“其他情况”是指
bucket
4). 4)。 You want to keep the value from a_temp4 if it exists (the join will set its values at null if it doesn't),
psf.coalesce
allows you to do this 您希望保留a_temp4中的值(如果存在)(如果不存在,联接会将其值设置为null),
psf.coalesce
允许您执行此操作
import pyspark.sql.functions as psf
a_df = a_df.join(a_temp4, ["category_id", "bucket"], how="leftouter").select(
psf.coalesce(a_temp4.category, a_df.category).alias("category"),
"category_id",
"bucket",
psf.coalesce(a_temp4.prop_count, a_df.prop_count).alias("prop_count"),
psf.coalesce(a_temp4.event_count, a_df.event_count).alias("event_count"),
psf.coalesce(a_temp4.accum_prop_count, a_df.accum_prop_count).alias("accum_prop_count")
)
+--------+-----------+------+----------+-----------+----------------+
|category|category_id|bucket|prop_count|event_count|accum_prop_count|
+--------+-----------+------+----------+-----------+----------------+
| state| state| 2| 444| 555| 666|
| nation| nation| 1| 222| 444| 555|
+--------+-----------+------+----------+-----------+----------------+
If you only work with one-line dataframes you should consider coding the update directly instead of using join: 如果仅使用单行数据帧,则应考虑直接对更新进行编码,而不要使用join:
def update_col(category_id, bucket, col_name, col_val):
return psf.when((a_df.category_id == category_id) & (a_df.bucket == bucket), col_val).otherwise(a_df[col_name]).alias(col_name)
a_df.select(
update_col("state", 2, "category", "nation"),
"category_id",
"bucket",
update_col("state", 2, "prop_count", 444),
update_col("state", 2, "event_count", 555),
update_col("state", 2, "accum_prop_count", 666)
).show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.