如何根据pyspark中数据框的值创建列

Question

I am having a df that contains below values, would like to create columns for each values present in the df.我有一个包含以下值的 df，想为 df 中存在的每个值创建列。 Looking for a solution in pyspark.Basically I could do this with a case when in pyspark, looking for a different approach.在 pyspark 中寻找解决方案。基本上我可以在 pyspark 中用一个案例来做到这一点，寻找不同的方法。 Any suggestions will be helpful.任何建议都会有所帮助。

DF: DF：

|number|color|
|------|-----|
|123   |red  |
|234   |blue |
|555   |white|

Expected output:预期输出：

number数字	red红色的	blue蓝色的	white白色的
123 123	1 1	0 0	0 0
234 234	0 0	1 1	0 0
555 555	0 0	0 0	1 1

Answer 1

You can group by number , pivot by color and apply the lit(1) value.您可以按number分组，按color旋转并应用lit(1)值。 To removing null values, apply [dataframe].na.fill(0)要删除空值，请应用[dataframe].na.fill(0)

import pyspark.sql.functions as f

df = spark.createDataFrame([
  [123, 'red'],
  [234, 'blue'],
  [555, 'white']
], ['number', 'color'])

pivot_df = df.groupBy('number').pivot('color').agg(f.lit(1))
pivot_df = pivot_df.na.fill(0)

(pivot_df
 .select('number', 'red', 'blue', 'white')
 .sort('number')
 .show(truncate=False))
# +------+---+----+-----+
# |number|red|blue|white|
# +------+---+----+-----+
# |123   |1  |0   |0    |
# |234   |0  |1   |0    |
# |555   |0  |0   |1    |
# +------+---+----+-----+

如何根据pyspark中数据框的值创建列

问题描述

1 个解决方案

解决方案1
0 2021-07-13 17:26:04

如何根据pyspark中数据框的值创建列

问题描述

1 个解决方案

解决方案1 0 2021-07-13 17:26:04

解决方案1
0 2021-07-13 17:26:04