[英]Create Pandas data frame with statistics from PySpark data frame
我有看起來像這樣的大 PySpark 數據框:
from pyspark.sql.functions import col, to_timestamp
data = [('2010-09-12 0', 'x1', 13),
('2010-09-12 0', 'x2', 12),
('2010-09-12 2', 'x3', 23),
('2010-09-12 4', 'x1', 22),
('2010-09-12 4', 'x2', 32),
('2010-09-12 4', 'x3', 7),
('2010-09-12 6', 'x3', 24),
('2010-09-12 16', 'x3', 34),]
columns = ['timestamp', 'category', 'value']
df =spark.createDataFrame(data=data, schema=columns)
df = df.withColumn('ts', to_timestamp(col('timestamp'), 'yyyy-MM-dd H')).drop(col('timestamp'))
df.show()
+--------+-----+-------------------+
|category|value| ts|
+--------+-----+-------------------+
| x1| 13|2010-09-12 00:00:00|
| x2| 12|2010-09-12 00:00:00|
| x3| 23|2010-09-12 02:00:00|
| x1| 22|2010-09-12 04:00:00|
| x2| 32|2010-09-12 04:00:00|
| x3| 7|2010-09-12 04:00:00|
| x3| 24|2010-09-12 06:00:00|
| x3| 34|2010-09-12 16:00:00|
+--------+-----+-------------------+
ts
列中的時間戳每隔 2 小時就會增加一次(例如0
, 2
, ..., 22
)
我想通過ts
時間戳提取列value
的average
、 min
、 max
、 median
,並將這些統計信息放入pandas
數據框中,如下所示:
import pandas as pd
import datetime
start_ts = datetime.datetime(year=2010, month=2, day=1, hour=0)
end_ts = datetime.datetime(year=2022, month=6, day=1, hour=22)
ts average min max median
...
2010-09-12 00:00:00 12.5 12 13 12.5
2010-09-12 02:00:00 23 23 23 23
2010-09-12 04:00:00 20.3 7 32 22
2010-09-12 06:00:00 24 24 24 24
2010-09-12 16:00:00 34 34 34 34
...
什么是一種經濟的方法來做到這一點,最大限度地減少對pyspark
數據幀的迭代次數?
聚合然后將結果轉換為熊貓:
from pyspark.sql import functions as F
df1 = df.groupby("ts").agg(
F.avg("value").alias("average"),
F.min("value").alias("min"),
F.max("value").alias("max"),
F.percentile_approx("value", 0.5).alias("median")
)
result = df1.toPandas()
# ts average min max median
# 0 2010-09-12 00:00:00 12.500000 12 13 12
# 1 2010-09-12 02:00:00 23.000000 23 23 23
# 2 2010-09-12 04:00:00 20.333333 7 32 22
# 3 2010-09-12 06:00:00 24.000000 24 24 24
# 4 2010-09-12 16:00:00 34.000000 34 34 34
以下應該計算准確的中位數,但您不應該對非常大的數據組使用准確的中位數。
此外,您可以在沒有datetime
模塊的情況下過濾數據。
from pyspark.sql import functions as F
df = (df
.filter(F.col('ts').between('2010-02-01', '2022-06-01'))
.groupBy('ts').agg(
F.round(F.mean('value'), 1).alias('average'),
F.min('value').alias('min'),
F.max('value').alias('max'),
F.expr('percentile(value, .5)').alias('median'),
)
)
pdf = df.toPandas()
print(pdf)
# ts average min max median
# 0 2010-09-12 02:00:00 23.0 23 23 23.0
# 1 2010-09-12 00:00:00 12.5 12 13 12.5
# 2 2010-09-12 06:00:00 24.0 24 24 24.0
# 3 2010-09-12 16:00:00 34.0 34 34 34.0
# 4 2010-09-12 04:00:00 20.3 7 32 22.0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.