Pyspark 分类数据矢量化及其相关数值

Question

I'm a newbie in Pyspark programming.我是 Pyspark 编程的新手。 I need some help.我需要一些帮助。

I have a dataset with a categorical feature and some associated numerical values with it.我有一个具有分类特征和一些相关数值的数据集。 I would like to vectorize the categorical value including the associated numerical value with it.我想将包含相关数值的分类值矢量化。 I have ~3 Million possible values for the categorical data column.我有大约 300 万个分类数据列的可能值。

Answer 1

You can group by UserID and aggregate the Quantity column into an array:您可以按 UserID 分组并将 Quantity 列聚合到一个数组中：

import pyspark.sql.functions as F

df2 = df.groupBy('UserID').agg(F.collect_list('Quantity').alias('Quantity'))

But this may not ensure that the order of fruits remains correct.但这可能无法确保水果的顺序保持正确。 To achieve that, you can use a more sophisticated method that involves sorting:为此，您可以使用涉及排序的更复杂的方法：

df2 = df.groupBy('UserID').agg(
    F.expr("transform(array_sort(collect_list(array(`Fruit Purchased`, Quantity))), x -> x[1]) Quantity")
)

Or you can do a pivot instead, which also ensures order of fruits:或者您可以改用 pivot，这也可以确保水果的顺序：

df2 = df.groupBy('UserID').pivot('Fruit Purchased').agg(F.first('Quantity'))
df3 = df2.select('UserID', F.array([c for c in df2.columns[1:]]).alias('Quantity'))

Pyspark 分类数据矢量化及其相关数值

问题描述

1 个解决方案

解决方案1
0 2021-01-22 07:14:51

Pyspark 分类数据矢量化及其相关数值

问题描述

1 个解决方案

解决方案1 0 2021-01-22 07:14:51

解决方案1
0 2021-01-22 07:14:51