簡體   English   中英

使用 PySpark 數據框的統計信息創建 Pandas 數據框

[英]Create Pandas data frame with statistics from PySpark data frame

我有看起來像這樣的大 PySpark 數據框:

from pyspark.sql.functions import col, to_timestamp

data = [('2010-09-12 0', 'x1', 13), 
        ('2010-09-12 0', 'x2', 12), 
        ('2010-09-12 2', 'x3', 23), 
        ('2010-09-12 4', 'x1', 22), 
        ('2010-09-12 4', 'x2', 32), 
        ('2010-09-12 4', 'x3', 7), 
        ('2010-09-12 6', 'x3', 24),
        ('2010-09-12 16', 'x3', 34),]

columns = ['timestamp', 'category', 'value']
df =spark.createDataFrame(data=data, schema=columns)
df = df.withColumn('ts', to_timestamp(col('timestamp'), 'yyyy-MM-dd H')).drop(col('timestamp'))
df.show()

+--------+-----+-------------------+
|category|value|                 ts|
+--------+-----+-------------------+
|      x1|   13|2010-09-12 00:00:00|
|      x2|   12|2010-09-12 00:00:00|
|      x3|   23|2010-09-12 02:00:00|
|      x1|   22|2010-09-12 04:00:00|
|      x2|   32|2010-09-12 04:00:00|
|      x3|    7|2010-09-12 04:00:00|
|      x3|   24|2010-09-12 06:00:00|
|      x3|   34|2010-09-12 16:00:00|
+--------+-----+-------------------+

ts列中的時間戳每隔 2 小時就會增加一次(例如0 , 2 , ..., 22

我想通過ts時間戳提取列valueaverageminmaxmedian ,並將這些統計信息放入pandas數據框中,如下所示:

import pandas as pd
import datetime

start_ts = datetime.datetime(year=2010, month=2, day=1, hour=0)
end_ts = datetime.datetime(year=2022, month=6, day=1, hour=22)
ts                      average   min    max   median 
...
2010-09-12 00:00:00     12.5      12     13    12.5
2010-09-12 02:00:00     23        23     23    23
2010-09-12 04:00:00     20.3      7      32    22
2010-09-12 06:00:00     24        24     24    24
2010-09-12 16:00:00     34        34     34    34
...

什么是一種經濟的方法來做到這一點,最大限度地減少對pyspark數據幀的迭代次數?

聚合然后將結果轉換為熊貓:

from pyspark.sql import functions as F

df1 = df.groupby("ts").agg(
    F.avg("value").alias("average"),
    F.min("value").alias("min"),
    F.max("value").alias("max"),
    F.percentile_approx("value", 0.5).alias("median")
)

result = df1.toPandas()

#                    ts    average  min  max  median
# 0 2010-09-12 00:00:00  12.500000   12   13      12
# 1 2010-09-12 02:00:00  23.000000   23   23      23
# 2 2010-09-12 04:00:00  20.333333    7   32      22
# 3 2010-09-12 06:00:00  24.000000   24   24      24
# 4 2010-09-12 16:00:00  34.000000   34   34      34

以下應該計算准確的中位數,但您不應該對非常大的數據組使用准確的中位數。

此外,您可以在沒有datetime模塊的情況下過濾數據。

from pyspark.sql import functions as F
df = (df
    .filter(F.col('ts').between('2010-02-01', '2022-06-01'))
    .groupBy('ts').agg(
        F.round(F.mean('value'), 1).alias('average'),
        F.min('value').alias('min'),
        F.max('value').alias('max'),
        F.expr('percentile(value, .5)').alias('median'),
    )
)
pdf = df.toPandas()
print(pdf)
#                    ts  average  min  max  median
# 0 2010-09-12 02:00:00     23.0   23   23    23.0
# 1 2010-09-12 00:00:00     12.5   12   13    12.5
# 2 2010-09-12 06:00:00     24.0   24   24    24.0
# 3 2010-09-12 16:00:00     34.0   34   34    34.0
# 4 2010-09-12 04:00:00     20.3    7   32    22.0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM