[英]How to get unique values of every column in PySpark DataFrame and save the results in a DataFrame?
假設我有一個 Spark DataFrame,如下所示:
data = [("A", "A", 1), \
("A", "A", 2), \
("A", "A", 3), \
("A", "B", 4), \
("A", "B", 5), \
("A", "C", 6), \
("A", "D", 7), \
("A", "E", None), \
]
columns= ["col_1", "col_2", "col_3"]
df = spark.createDataFrame(data = data, schema = columns)
我想獲取每一列的唯一條目列表,並將結果保存在 DataFrame 中。output 將是:
列名 | 唯一值 |
---|---|
col_1 | ['一種'] |
列_2 | ['A', 'B', 'C', 'D', 'E'] |
col_3 | [1, 2, 3, 4, 5, 6, 7, 空] |
知道怎么做嗎?
一種實現方式是:
melt()
function 不可用。sdf = sdf.withColumn("dummy", F.lit("1")) \
.groupBy("dummy") \
.agg(*[F.collect_set(c).alias(c) for c in sdf.columns]) \
.drop("dummy") \
[Out]:
+-----+---------------+---------------------+
|col_1|col_2 |col_3 |
+-----+---------------+---------------------+
|[A] |[C, E, B, A, D]|[1, 5, 2, 6, 3, 7, 4]|
+-----+---------------+---------------------+
pdf = sdf.toPandas() \
.T \
.reset_index() \
.rename(columns={0: "Unique_Values", "index": "Column_Name"})
[Out]:
Column_Name Unique_Values
0 col_1 [A]
1 col_2 [C, E, B, A, D]
2 col_3 [1, 5, 2, 6, 3, 7, 4]
如您所見,不包括None
或null
。 要包含它,您需要做一些額外的處理:將列轉換為字符串類型。 如果要保留原始類型,則需要跟蹤每一列並適當地轉換它們。
for c in sdf.columns:
sdf = sdf.withColumn(c, F.col(c).cast("string")).na.fill("_NULL_")
並更換回來:
pdf["Unique_Values"] = pdf["Unique_Values"].apply(lambda x: [None if v == "_NULL_" else v for v in x])
完整示例:
data = [("A", "A", 1), \
("A", "A", 2), \
("A", "A", 3), \
("A", "B", 4), \
("A", "B", 5), \
("A", "C", 6), \
("A", "D", 7), \
("A", "E", None), \
]
columns= ["col_1", "col_2", "col_3"]
sdf = spark.createDataFrame(data = data, schema = columns)
for c in sdf.columns:
sdf = sdf.withColumn(c, F.col(c).cast("string")).na.fill("_NULL_")
sdf = sdf.withColumn("dummy", F.lit("1")) \
.groupBy("dummy") \
.agg(*[F.collect_set(c).alias(c) for c in sdf.columns]) \
.drop("dummy") \
pdf = sdf.toPandas() \
.T \
.reset_index() \
.rename(columns={0: "Unique_Values", "index": "Column_Name"})
pdf["Unique_Values"] = pdf["Unique_Values"].apply(lambda x: [None if v == "_NULL_" else v for v in x])
[Out]:
Column_Name Unique_Values
0 col_1 [A]
1 col_2 [C, E, B, A, D]
2 col_3 [3, None, 1, 2, 5, 4, 7, 6]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.