繁体   English   中英

Pyspark Dataframe - 如何根据 2 列中的数据在数据框中添加多列

[英]Pyspark Dataframe - how to add multiple columns in dataframe, based on data in 2 columns

我有一个 pyspark 数据框要求,我需要输入:

这是场景:

df1 schema:

root
  |-- applianceName: string (nullable = true)
  |-- customer: string (nullable = true)
  |-- daysAgo: integer (nullable = true)
  |-- countAnomaliesByDay: long (nullable = true)

Sample Data:
applianceName | customer | daysAgo| countAnomaliesByDay
app1           cust1       0        100
app1           cust1       1        200
app1           cust1       2       300
app1           cust1       3       400
app1           cust1       4       500
app1           cust1       5       600
app1           cust1       6       700

In df1 schema, I need to add columns - day0,day1,day2,day3,day4,day5,day6 as shown below :


applianceName | customer | day0 | day1| day2 | day3 | day4 | day5| day6
app1            cust1      100     200  300    400    500    600   700  

i.e. column day0 - will have countAnomaliesByDay when daysAgo =0, column day1 - will have countAnomaliesByDay when daysAgo =1 and so on. 

我如何实现这一目标?

蒂亚!

我希望,这对您的解决方案有用。 我使用 pyspark 的 pivot 函数来执行此操作,

import findspark
findspark.init()
findspark.find()
from pyspark.sql import *
from pyspark.sql.types import IntegerType, StringType, StructType, StructField

# create a Spark Session
spark = SparkSession.builder.appName('StackOverflowMultiple').getOrCreate()
newDF=[
       StructField('applianceName',StringType(),True),
       StructField('customer',StringType(),True),
       StructField('daysAgo',StringType(),True),
       StructField('countAnomaliesByDay',IntegerType(),True)
       ]
finalStruct=StructType(fields=newDF)
df = spark.read.csv('./pyspark_add_multiple_cols.csv', schema=finalStruct, header=True)
df_pivot = df.groupBy('applianceName', 'customer', 'daysAgo') \
    .sum('countAnomaliesByDay') \
    .groupBy('applianceName', 'customer') \
    .pivot('daysAgo') \
    .sum('sum(countAnomaliesByDay)')
df_pivot.show(truncate=False)
df_pivot = df.groupby('applianceName', 'customer') \
.pivot('daysAgo') \
.max('countAnomaliesByDay') \
.fillna(0)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM