[英]Add new column in Pyspark dataframe based on where condition on other column
我有一個Pyspark數據框,如下所示:
+------------+-------------+--------------------+
|package_id | location | package_scan_code |
+------------+-------------+--------------------+
|123 | Denver |05 |
|123 | LosAngeles |03 |
|123 | Dallas |09 |
|123 | Vail |02 |
|456 | Jacksonville|05 |
|456 | Nashville |09 |
|456 | Memphis |03 |
“ package_scan_code” 03表示包的來源。
我想向此數據幀添加一列“來源”,以便對於每個包(由“ package_id”標識),新添加的“來源”列中的值將與“ package_scan_code” 03對應的位置相同。
在上述情況下,有兩個唯一的程序包123和456,它們的起源分別是洛杉磯和孟菲斯(對應於package_scan_code 03)。
所以我希望輸出如下:
+------------+-------------+--------------------+------------+
| package_id |location | package_scan_code |origin |
+------------+-------------+--------------------+------------+
|123 | Denver |05 | LosAngeles |
|123 | LosAngeles |03 | LosAngeles |
|123 | Dallas |09 | LosAngeles |
|123 | Vail |02 | LosAngeles |
|456 | Jacksonville|05 | Memphis |
|456 | Nashville |09 | Memphis |
|456 | Memphis |03 | Memphis |
如何在Pyspark中實現這一目標? 我嘗試了.withColumn
方法,但無法獲得正確的條件。
通過package_scan_code == '03'
過濾數據框,然后與原始數據框重新連接:
(df.filter(df.package_scan_code == '03')
.selectExpr('package_id', 'location as origin')
.join(df, ['package_id'], how='right')
.show())
+----------+----------+------------+-----------------+
|package_id| origin| location|package_scan_code|
+----------+----------+------------+-----------------+
| 123|LosAngeles| Denver| 05|
| 123|LosAngeles| LosAngeles| 03|
| 123|LosAngeles| Dallas| 09|
| 123|LosAngeles| Vail| 02|
| 456| Memphis|Jacksonville| 05|
| 456| Memphis| Nashville| 09|
| 456| Memphis| Memphis| 03|
+----------+----------+------------+-----------------+
注意:這里假設你最多只能有一個package_scan_code
等於03
每package_id
,否則邏輯就不會是正確的,你需要重新考慮如何origin
應該被定義。
無論數據幀中每個package_id
發生package_scan_code=03
多少次,此代碼都應起作用。 我又添加了一個(123,'LosAngeles','03')
來證明這一點-
步驟1:創建數據框
values = [(123,'Denver','05'),(123,'LosAngeles','03'),(123,'Dallas','09'),(123,'Vail','02'),(123,'LosAngeles','03'),
(456,'Jacksonville','05'),(456,'Nashville','09'),(456,'Memphis','03')]
df = sqlContext.createDataFrame(values,['package_id','location','package_scan_code'])
第2步:創建package_id
和location
的字典。
df_count = df.where(col('package_scan_code')=='03').groupby('package_id','location').count()
dict_location_scan_code = dict(df_count.rdd.map(lambda x: (x['package_id'], x['location'])).collect())
print(dict_location_scan_code)
{456: 'Memphis', 123: 'LosAngeles'}
步驟3:創建一列,映射字典。
from pyspark.sql.functions import col, create_map, lit
from itertools import chain
mapping_expr = create_map([lit(x) for x in chain(*dict_location_scan_code.items())])
df = df.withColumn('origin', mapping_expr.getItem(col('package_id')))
df.show()
+----------+------------+-----------------+----------+
|package_id| location|package_scan_code| origin|
+----------+------------+-----------------+----------+
| 123| Denver| 05|LosAngeles|
| 123| LosAngeles| 03|LosAngeles|
| 123| Dallas| 09|LosAngeles|
| 123| Vail| 02|LosAngeles|
| 123| LosAngeles| 03|LosAngeles|
| 456|Jacksonville| 05| Memphis|
| 456| Nashville| 09| Memphis|
| 456| Memphis| 03| Memphis|
+----------+------------+-----------------+----------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.