简体   繁体   English

PySpark get_dummies 等效

[英]PySpark get_dummies equivalent

I have a pyspark dataframe with the following schema:我有一个 pyspark dataframe 具有以下架构:

Key1键1 Key2键2 Key3键3 Value价值
a一个 a一个 a一个 "value1" “价值1”
a一个 a一个 a一个 "value2" “价值2”
a一个 a一个 b b "value1" “价值1”
b b b b a一个 "value2" “价值2”

(In real life this dataframe is extremely large, not reasonable to convert to pandas DF) (在现实生活中,这个 dataframe 非常大,转换为 pandas DF 是不合理的)

My goal is to transform the dataframe to look like so:我的目标是将 dataframe 转换为如下所示:

Key1键1 Key2键2 Key3键3 value1价值1 value2价值2
a一个 a一个 a一个 1 1 1 1
a一个 a一个 b b 1 1 0 0
b b b b a一个 0 0 1 1

I know this is possible in pandas using the get_dummies function and I have also seen that there is some sort of pyspark & pandas hybrid function that I am not sure I can use. I know this is possible in pandas using the get_dummies function and I have also seen that there is some sort of pyspark & pandas hybrid function that I am not sure I can use.

It is worth mentioning that column Value can receive (in this example) only the values "value1" and "value2" I have encountered this question that possibly solves my problem but I do not entirely understand it and was wondering if there was a simpler way to solve the problem.值得一提的是, Value列只能接收(在此示例中)值"value1""value2"我遇到过这个问题,可能解决了我的问题,但我不完全理解它,想知道是否有更简单的方法解决问题。
Any help is greatly appreciated任何帮助是极大的赞赏

SMALL EDIT小编辑

After implementing the accepted solution, to turn this into a one-hot encoding and not just a sum of appearances, I converted each column to boolean type and then back to integer.在实施公认的解决方案后,为了将其转换为单热编码而不仅仅是外观的总和,我将每一列转换为 boolean 类型,然后再转换回 integer。

This can be achieved by group by twice.这可以通过两次分组来实现。

df = df.groupby(*df.columns).agg(F.count('*').alias('cnt')) \
    .groupby('Key1', 'Key2', 'Key3').pivot('Value').agg(F.sum('cnt')).fillna(0)
df.show(truncate=False)

# +----+----+----+------+------+
# |Key1|Key2|Key3|value1|value2|
# +----+----+----+------+------+
# |a   |a   |b   |1     |0     |
# |b   |b   |a   |0     |1     |
# |a   |a   |a   |1     |1     |
# +----+----+----+------+------+

You can group by on the key columns and pivot the value column while counting all records.您可以在计算所有记录时对键列和 pivot 值列进行分组。

data_sdf. \
    groupBy('key1', 'key2', 'key3'). \
    pivot('val'). \
    agg(func.count('*')). \
    fillna(0). \
    show()

# +----+----+----+------+------+
# |key1|key2|key3|value1|value2|
# +----+----+----+------+------+
# |   b|   b|   a|     0|     1|
# |   a|   a|   a|     1|     1|
# |   a|   a|   b|     1|     0|
# +----+----+----+------+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM