[英]How to apply filter on a column (with datatype array (of strings)) on a PySpark dataframe?
我有一個 PySpark dataframe:
df = spark.createDataFrame([
("u1", ['a', 'b']),
("u2", ['c', 'b']),
("u3", ['a', 'b']),
],
['user_id', 'features'])
print(df.printSchema())
df.show(truncate=False)
Output:
root
|-- user_id: string (nullable = true)
|-- features: array (nullable = true)
| |-- element: string (containsNull = true)
None
+-------+--------+
|user_id|features|
+-------+--------+
|u1 |[a, b] |
|u2 |[c, b] |
|u3 |[a, b] |
+-------+--------+
我只想保留名為features [a, b] 的列。 由於該列是字符串數組,因此不能使用簡單過濾器。
我怎樣才能做到這一點?
預期 output:
+-------+--------+
|user_id|features|
+-------+--------+
|u1 |[a, b] |
|u3 |[a, b] |
+-------+--------+
您可以使用array(lit(...))
import pyspark.sql.functions as F
df2 = df.filter(F.array_sort(F.col('features')) == F.array_sort(F.array(F.lit('a'), F.lit('b'))))
df2.show()
+-------+--------+
|user_id|features|
+-------+--------+
| u1| [a, b]|
| u3| [a, b]|
+-------+--------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.