在 pyspark 中创建一个带有时间戳和 groupby 的 KPI

Question

I have a dataframe containing logs just like this example:我有一个 dataframe 包含日志就像这个例子：

+------------+--------------------------+--------------------+-------------------+
|Source      |Error                     |          @timestamp| timestamp_rounded |
+------------+--------------------------+--------------------+-------------------+
|      A     |             No           |2021-09-12T14:07:...|2021-09-12 16:10:00|
|      B     |             No           |2021-09-12T12:49:...|2021-09-12 14:50:00|
|      C     |             No           |2021-09-12T12:59:...|2021-09-12 15:00:00|
|      C     |             No           |2021-09-12T12:58:...|2021-09-12 15:00:00|
|      B     |             No           |2021-09-12T14:22:...|2021-09-12 16:20:00|
|      A     |             Yes          |2021-09-12T14:22:...|2021-09-12 16:25:00|
|      B     |             No           |2021-09-12T13:00:...|2021-09-12 15:00:00|
|      B     |             No           |2021-09-12T12:57:...|2021-09-12 14:55:00|
|      B     |             No           |2021-09-12T12:57:...|2021-09-12 15:00:00|
|      B     |             No           |2021-09-12T12:58:...|2021-09-12 15:00:00|
|      C     |             No           |2021-09-12T12:54:...|2021-09-12 14:55:00|
|      A     |             Yes          |2021-09-12T14:17:...|2021-09-12 16:15:00|
|      B     |             No           |2021-09-12T12:43:...|2021-09-12 14:45:00|
|      A     |             No           |2021-09-12T12:45:...|2021-09-12 14:45:00|
|      D     |             No           |2021-09-12T12:57:...|2021-09-12 14:55:00|
|      A     |             No           |2021-09-12T13:00:...|2021-09-12 15:00:00|
|      C     |             No           |2021-09-12T12:47:...|2021-09-12 14:45:00|
|      A     |             No           |2021-09-12T12:57:...|2021-09-12 15:00:00|
|      A     |             No           |2021-09-12T13:00:...|2021-09-12 15:00:00|
|      A     |             No           |2021-09-12T14:23:...|2021-09-12 16:25:00|
+------------+--------------------------+--------------------+-------------------+
only showing top 20 rows

My dataframe has million of logs, not that it matters.我的 dataframe 有数百万条日志，这并不重要。

I would like to calculate the error rate of every source, for every 5 minutes .我想每 5 分钟计算一次每个来源的错误率。 I have searched for documentation on transformations like this one (groupby with partition? double groupby?...) but I haven't found a lot of information.我已经搜索了有关此类转换的文档（groupby with partition？double groupby？...）但我没有找到很多信息。

I can get a new column with Yes ==> 1 and No ==> 0 and then get the mean for every source with gorupby and {avg: foo} to get the error rate for every source, but I want it to be for every 5 min (see col 'timestamp_rounded')我可以使用 Yes ==> 1 和 No ==> 0 获得一个新列，然后使用gorupby和{avg: foo}获得每个来源的平均值以获得每个来源的错误率，但我希望它是每 5 分钟一次（参见“timestamp_rounded”列）

The result would be like:结果会是这样的：

+-------------------+------------+--------------+-------------+------------+
|timestamp_rounded  |Error_rate_A| Error_rate_B | Error_rate_C|Error_rate_D|
+-------------------+------------+--------------+-------------+------------+
|2021-09-12 16:10:00|       0    |       0.2    |       0     |       0.2  |
|2021-09-12 16:15:00|       0.1  |       0.3    |       0     |       0    |
|2021-09-12 16:20:00|       0    |       0.2    |       0     |       0    |
|2021-09-12 16:25:00|       0    |       0.2    |       0     |       0    |
|2021-09-12 16:30:00|       0    |       0.2    |       0     |       0    |
|2021-09-12 16:35:00|       0.2  |       0.2    |       0     |       0    |
|2021-09-12 16:40:00|       0.3  |       0.2    |       0     |       0.2  |
|2021-09-12 16:45:00|       0.4  |       0.3    |       0     |       0    |

etc...

Sources can be very numerous (my example has 4 but there can be thousands of sources)来源可以非常多（我的示例有 4 个，但可以有数千个来源）

Please tell me if you need more information.如果您需要更多信息，请告诉我。 Thanks a lot !多谢！

Answer 1

Assuming your data is accessible in a dataframe named logs you could achieve this with an initial group by on timestamp_rounded then a pivot on source to transpose your aggregated error rates to rows with columns for each source error rate for each timestamp_rounded .假设您的数据可以在名为 dataframe 的logs中访问，您可以通过在timestamp_rounded上进行初始分组然后在source上使用 pivot 来实现这一点，以将您的聚合错误率转换为包含每个timestamp_rounded的每个source错误率的列的行。 Finally, you may replace missing error rate values with 0.0最后，您可以将缺失的错误率值替换为0.0

Before performing these transformations, we can transform your Yes / No values to 1 / 0 to simplify the aggregation/mean and rename the source column values with a prefix Error_rate_ to achieve the desired column names after the pivot.在执行这些转换之前，我们可以将您的Yes / No值转换为1 / 0以简化聚合/均值，并使用前缀Error_rate_重命名source列值以在 pivot 之后获得所需的列名称。

NB.注意。 I changed 1 of your records in the sample data in the question我在问题的示例数据中更改了您的 1 条记录

|      A     |             No           |2021-09-12T12:57:...|2021-09-12 15:00:00|

to到

|      A     |             Yes           |2021-09-12T12:57:...|2021-09-12 15:00:00|

to receive more variation in your data.接收更多数据变化。 As a result your dataframe would look like this after your initial aggregation.因此，您的 dataframe 在初始聚合后看起来像这样。

You may achieve this using the following:您可以使用以下方法实现此目的：

output_df =(
    logs.withColumn("Error",F.when(F.col("Error")=="Yes",1).otherwise(0))
        .withColumn("Source",F.concat(F.lit("Error_rate_"),F.col("Source")))
        .groupBy("timestamp_rounded")
        .pivot("Source")
        .agg(
            F.round(F.mean("Error"),2).alias("Error_rate")
        )
        .na.fill(0.0)
)

Outputs产出

+-------------------+------------+------------+------------+------------+
|timestamp_rounded  |Error_rate_A|Error_rate_B|Error_rate_C|Error_rate_D|
+-------------------+------------+------------+------------+------------+
|2021-09-12 14:50:00|0.0         |0.0         |0.0         |0.0         |
|2021-09-12 16:15:00|1.0         |0.0         |0.0         |0.0         |
|2021-09-12 16:20:00|0.0         |0.0         |0.0         |0.0         |
|2021-09-12 16:25:00|0.5         |0.0         |0.0         |0.0         |
|2021-09-12 14:55:00|0.0         |0.0         |0.0         |0.0         |
|2021-09-12 14:45:00|0.0         |0.0         |0.0         |0.0         |
|2021-09-12 16:10:00|0.0         |0.0         |0.0         |0.0         |
|2021-09-12 15:00:00|0.33        |0.0         |0.0         |0.0         |
+-------------------+------------+------------+------------+------------+

NB.注意。 The output above is not ordered and can easily be ordered using .orderBy上面的 output 没有排序，可以使用.orderBy轻松排序

Let me know if this works for you.如果这对你有用，请告诉我。

在 pyspark 中创建一个带有时间戳和 groupby 的 KPI

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-09-28 18:07:28

在 pyspark 中创建一个带有时间戳和 groupby 的 KPI

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-09-28 18:07:28

解决方案1
1 已采纳 2021-09-28 18:07:28