[英]How to create columns from values of a data frame in pyspark
I am having a df that contains below values, would like to create columns for each values present in the df.我有一个包含以下值的 df,想为 df 中存在的每个值创建列。 Looking for a solution in pyspark.Basically I could do this with a case when in pyspark, looking for a different approach.
在 pyspark 中寻找解决方案。基本上我可以在 pyspark 中用一个案例来做到这一点,寻找不同的方法。 Any suggestions will be helpful.
任何建议都会有所帮助。
DF: DF:
|number|color|
|------|-----|
|123 |red |
|234 |blue |
|555 |white|
Expected output:预期输出:
number![]() |
red![]() |
blue![]() |
white![]() |
---|---|---|---|
123 ![]() |
1 ![]() |
0 ![]() |
0 ![]() |
234 ![]() |
0 ![]() |
1 ![]() |
0 ![]() |
555 ![]() |
0 ![]() |
0 ![]() |
1 ![]() |
You can group by number
, pivot by color
and apply the lit(1)
value.您可以按
number
分组,按color
旋转并应用lit(1)
值。 To removing null values, apply [dataframe].na.fill(0)
要删除空值,请应用
[dataframe].na.fill(0)
import pyspark.sql.functions as f
df = spark.createDataFrame([
[123, 'red'],
[234, 'blue'],
[555, 'white']
], ['number', 'color'])
pivot_df = df.groupBy('number').pivot('color').agg(f.lit(1))
pivot_df = pivot_df.na.fill(0)
(pivot_df
.select('number', 'red', 'blue', 'white')
.sort('number')
.show(truncate=False))
# +------+---+----+-----+
# |number|red|blue|white|
# +------+---+----+-----+
# |123 |1 |0 |0 |
# |234 |0 |1 |0 |
# |555 |0 |0 |1 |
# +------+---+----+-----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.