[英]Removing null values from Array after merging columns- pyspark
我有這個 pyspark dataframe
東風:
+---------+----+----+----+----+----+----+----+----+----+
|partition| 1| 2| 3| 4| 5| 6| 7| 8| 9|
+---------+----+----+----+----+----+----+----+----+----+
| 7|null|null|null|null|null|null| 0.7|null|null|
| 1| 0.2| 0.1| 0.3|null|null|null|null|null|null|
| 8|null|null|null|null|null|null|null| 0.8|null|
| 4|null|null|null| 0.4| 0.5| 0.6|null|null| 0.9|
+---------+----+----+----+----+----+----+----+----+----+
我結合了
+---------+--------------------+
|partition| vec_comb|
+---------+--------------------+
| 7| [,,,,,,,, 0.7]|
| 1|[,,,,,, 0.1, 0.2,...|
| 8| [,,,,,,,, 0.8]|
| 4|[,,,,, 0.4, 0.5, ...|
+---------+--------------------+
如何從vec_comb
列的 arrays 中刪除NullTypes
?
預期 output:
+---------+--------------------+
|partition| vec_comb|
+---------+--------------------+
| 7| [0.7]|
| 1| [0.1, 0.2,0.3]|
| 8| [0.8]|
| 4|[0.4, 0.5, 0.6, 0,9]|
+---------+--------------------+
我已經嘗試過(顯然是錯誤的,但我不能把頭繞在這個問題上):
def clean_vec(array):
new_Array = []
for element in array:
if type(element)==FloatType():
new_Array.append(element)
return new_Array
udf_clean_vec = F.udf(f=(lambda c: clean_vec(c)), returnType=ArrayType(FloatType()))
df = df.withColumn('vec_comb_cleaned', udf_clean_vec('vec_comb'))
您可以使用高階 function filter
來刪除 null 元素:
import pyspark.sql.functions as F
df2 = df.withColumn('vec_comb_cleaned', F.expr('filter(vec_comb, x -> x is not null)'))
df2.show()
+---------+--------------------+--------------------+
|partition| vec_comb| vec_comb_cleaned|
+---------+--------------------+--------------------+
| 7| [,,,,,, 0.7,,]| [0.7]|
| 1|[0.2, 0.1, 0.3,,,...| [0.2, 0.1, 0.3]|
| 8| [,,,,,,, 0.8,]| [0.8]|
| 4|[,,, 0.4, 0.5, 0....|[0.4, 0.5, 0.6, 0.9]|
+---------+--------------------+--------------------+
您可以使用 UDF,但它會更慢,例如
udf_clean_vec = F.udf(lambda x: [i for i in x if i is not None], 'array<float>')
df2 = df.withColumn('vec_comb_cleaned', udf_clean_vec('vec_comb'))
不使用 pyspark 特定的功能,您還可以通過filter
掉NaN
來創建一個list
:
df['vec_comb'] = df.iloc[:, 1:10].apply(lambda r: list(filter(pd.notna, r)) , axis=1)
df
# Output:
partition 1 2 3 4 5 6 7 8 9 vec_comb
0 7 NaN NaN NaN NaN NaN NaN 0.7 NaN NaN [0.7]
1 1 0.2 0.1 0.3 NaN NaN NaN NaN NaN NaN [0.2, 0.1, 0.3]
2 8 NaN NaN NaN NaN NaN NaN NaN 0.8 NaN [0.8]
3 4 NaN NaN NaN 0.4 0.5 0.6 NaN NaN 0.9 [0.4, 0.5, 0.6, 0.9]
並通過僅選擇您想要的兩個來刪除舊列:
df = df[['partition', 'vec_comb']]
df
# Output:
partition vec_comb
0 7 [0.7]
1 1 [0.2, 0.1, 0.3]
2 8 [0.8]
3 4 [0.4, 0.5, 0.6, 0.9]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.