[英]How to map a column in PySpark DataFrame and avoid getting Null values?
我有一個 PySpark DataFrame,我想映射一列的值。
示例數據集:
data = [(1, 'N'), \
(2, 'N'), \
(3, 'C'), \
(4, 'S'), \
(5, 'North'), \
(6, 'Central'), \
(7, 'Central'), \
(8, 'South')
]
columns = ["ID", "City"]
df = spark.createDataFrame(data = data, schema = columns)
映射字典是:
{'N': 'North', 'C': 'Central', 'S': 'South'}
我使用以下代碼:
from pyspark.sql import functions as F
from itertools import chain
mapping_dict = {'N': 'North', 'C': 'Central', 'S': 'South'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*mapping_dict.items())])
df_new = df.withColumn('City_New', mapping_expr[df['City']])
結果是:
如您所見,我得到了行的 Null 值,我沒有將它們的值包含在映射字典中。 為了解決這個問題,我可以通過以下方式定義映射字典:
{'N': 'North', 'C': 'Central', 'S': 'South', \
'North': 'North', 'Central': 'Central', 'South': 'South'}
但是,如果數據集中有很多唯一值,則很難定義映射字典。
有沒有更好的方法來達到這個目的?
你可以使用coalesce
。
這是它的樣子。
# create separate case whens for each key-value pair
map_whens = [func.when(func.upper('city') == k.upper(), v) for k, v in map_dict.items()]
# [Column<'CASE WHEN (upper(city) = N) THEN North END'>,
# Column<'CASE WHEN (upper(city) = C) THEN Central END'>,
# Column<'CASE WHEN (upper(city) = S) THEN South END'>]
# pass case whens to coalesce with last value as `city` field
data_sdf. \
withColumn('city_new', func.coalesce(*map_whens, 'city')). \
show()
# +---+-------+--------+
# | id| city|city_new|
# +---+-------+--------+
# | 1| N| North|
# | 2| N| North|
# | 3| C| Central|
# | 4| S| South|
# | 5| North| North|
# | 6|Central| Central|
# | 7|Central| Central|
# | 8| South| South|
# +---+-------+--------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.