[英]pyspark dataframe groupby with aggregate unique values
I looked up for any reference for pyspark equivalent of pandas df.groupby(upc)['store'].unique()
where df is any dataframe in pandas.
請使用這段代碼在 Pyspark 中創建數據框
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql import *
from datetime import date
import pyspark.sql.functions as F
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data2 = [("36636","M",3000),
("40288","M",4000),
("42114","M",3000),
("39192","F",4000),
("39192","F",2000)
]
schema = StructType([ \
StructField("upc", StringType(), True), \
StructField("store", StringType(), True), \
StructField("sale", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
我知道 pyspark groupby unique_count,但需要有關 unique_values 的幫助
您可以應用collect_set
聚合來收集列中的唯一值。 請注意, collect_set
忽略null
值。
df.groupBy("upc").agg(F.collect_set("store")).show()
+-----+------------------+
| upc|collect_set(store)|
+-----+------------------+
|42114| [M]|
|40288| [M]|
|39192| [F]|
|36636| [M]|
+-----+------------------+
您可以使用collect_set獲取唯一值
from pyspark.sql import functions as F
from pyspark.sql.functions import col
df_group = df.groupBy('upc').agg(F.collect_set(col('store')))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.