简体   繁体   English

Pyspark 分类数据矢量化及其相关数值

[英]Pyspark Categorical data vectorization with numerical values associated with it

I'm a newbie in Pyspark programming.我是 Pyspark 编程的新手。 I need some help.我需要一些帮助。

I have a dataset with a categorical feature and some associated numerical values with it.我有一个具有分类特征和一些相关数值的数据集。 I would like to vectorize the categorical value including the associated numerical value with it.我想将包含相关数值的分类值矢量化。 I have ~3 Million possible values for the categorical data column.我有大约 300 万个分类数据列的可能值。

在此处输入图像描述

You can group by UserID and aggregate the Quantity column into an array:您可以按 UserID 分组并将 Quantity 列聚合到一个数组中:

import pyspark.sql.functions as F

df2 = df.groupBy('UserID').agg(F.collect_list('Quantity').alias('Quantity'))

But this may not ensure that the order of fruits remains correct.但这可能无法确保水果的顺序保持正确。 To achieve that, you can use a more sophisticated method that involves sorting:为此,您可以使用涉及排序的更复杂的方法:

df2 = df.groupBy('UserID').agg(
    F.expr("transform(array_sort(collect_list(array(`Fruit Purchased`, Quantity))), x -> x[1]) Quantity")
)

Or you can do a pivot instead, which also ensures order of fruits:或者您可以改用 pivot,这也可以确保水果的顺序:

df2 = df.groupBy('UserID').pivot('Fruit Purchased').agg(F.first('Quantity'))
df3 = df2.select('UserID', F.array([c for c in df2.columns[1:]]).alias('Quantity'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM