![](/img/trans.png)
[英]Pyspark - add columns to dataframe based on values from different dataframe
[英]Pyspark Dataframe - how to add multiple columns in dataframe, based on data in 2 columns
我有一个 pyspark 数据框要求,我需要输入:
这是场景:
df1 schema:
root
|-- applianceName: string (nullable = true)
|-- customer: string (nullable = true)
|-- daysAgo: integer (nullable = true)
|-- countAnomaliesByDay: long (nullable = true)
Sample Data:
applianceName | customer | daysAgo| countAnomaliesByDay
app1 cust1 0 100
app1 cust1 1 200
app1 cust1 2 300
app1 cust1 3 400
app1 cust1 4 500
app1 cust1 5 600
app1 cust1 6 700
In df1 schema, I need to add columns - day0,day1,day2,day3,day4,day5,day6 as shown below :
applianceName | customer | day0 | day1| day2 | day3 | day4 | day5| day6
app1 cust1 100 200 300 400 500 600 700
i.e. column day0 - will have countAnomaliesByDay when daysAgo =0, column day1 - will have countAnomaliesByDay when daysAgo =1 and so on.
我如何实现这一目标?
蒂亚!
我希望,这对您的解决方案有用。 我使用 pyspark 的 pivot 函数来执行此操作,
import findspark
findspark.init()
findspark.find()
from pyspark.sql import *
from pyspark.sql.types import IntegerType, StringType, StructType, StructField
# create a Spark Session
spark = SparkSession.builder.appName('StackOverflowMultiple').getOrCreate()
newDF=[
StructField('applianceName',StringType(),True),
StructField('customer',StringType(),True),
StructField('daysAgo',StringType(),True),
StructField('countAnomaliesByDay',IntegerType(),True)
]
finalStruct=StructType(fields=newDF)
df = spark.read.csv('./pyspark_add_multiple_cols.csv', schema=finalStruct, header=True)
df_pivot = df.groupBy('applianceName', 'customer', 'daysAgo') \
.sum('countAnomaliesByDay') \
.groupBy('applianceName', 'customer') \
.pivot('daysAgo') \
.sum('sum(countAnomaliesByDay)')
df_pivot.show(truncate=False)
df_pivot = df.groupby('applianceName', 'customer') \
.pivot('daysAgo') \
.max('countAnomaliesByDay') \
.fillna(0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.