简体   繁体   中英

Create a KPI with a timestamp and a groupby in pyspark

I have a dataframe containing logs just like this example:

+------------+--------------------------+--------------------+-------------------+
|Source      |Error                     |          @timestamp| timestamp_rounded |
+------------+--------------------------+--------------------+-------------------+
|      A     |             No           |2021-09-12T14:07:...|2021-09-12 16:10:00|
|      B     |             No           |2021-09-12T12:49:...|2021-09-12 14:50:00|
|      C     |             No           |2021-09-12T12:59:...|2021-09-12 15:00:00|
|      C     |             No           |2021-09-12T12:58:...|2021-09-12 15:00:00|
|      B     |             No           |2021-09-12T14:22:...|2021-09-12 16:20:00|
|      A     |             Yes          |2021-09-12T14:22:...|2021-09-12 16:25:00|
|      B     |             No           |2021-09-12T13:00:...|2021-09-12 15:00:00|
|      B     |             No           |2021-09-12T12:57:...|2021-09-12 14:55:00|
|      B     |             No           |2021-09-12T12:57:...|2021-09-12 15:00:00|
|      B     |             No           |2021-09-12T12:58:...|2021-09-12 15:00:00|
|      C     |             No           |2021-09-12T12:54:...|2021-09-12 14:55:00|
|      A     |             Yes          |2021-09-12T14:17:...|2021-09-12 16:15:00|
|      B     |             No           |2021-09-12T12:43:...|2021-09-12 14:45:00|
|      A     |             No           |2021-09-12T12:45:...|2021-09-12 14:45:00|
|      D     |             No           |2021-09-12T12:57:...|2021-09-12 14:55:00|
|      A     |             No           |2021-09-12T13:00:...|2021-09-12 15:00:00|
|      C     |             No           |2021-09-12T12:47:...|2021-09-12 14:45:00|
|      A     |             No           |2021-09-12T12:57:...|2021-09-12 15:00:00|
|      A     |             No           |2021-09-12T13:00:...|2021-09-12 15:00:00|
|      A     |             No           |2021-09-12T14:23:...|2021-09-12 16:25:00|
+------------+--------------------------+--------------------+-------------------+
only showing top 20 rows

My dataframe has million of logs, not that it matters.

I would like to calculate the error rate of every source, for every 5 minutes . I have searched for documentation on transformations like this one (groupby with partition? double groupby?...) but I haven't found a lot of information.

I can get a new column with Yes ==> 1 and No ==> 0 and then get the mean for every source with gorupby and {avg: foo} to get the error rate for every source, but I want it to be for every 5 min (see col 'timestamp_rounded')

The result would be like:

+-------------------+------------+--------------+-------------+------------+
|timestamp_rounded  |Error_rate_A| Error_rate_B | Error_rate_C|Error_rate_D|
+-------------------+------------+--------------+-------------+------------+
|2021-09-12 16:10:00|       0    |       0.2    |       0     |       0.2  |
|2021-09-12 16:15:00|       0.1  |       0.3    |       0     |       0    |
|2021-09-12 16:20:00|       0    |       0.2    |       0     |       0    |
|2021-09-12 16:25:00|       0    |       0.2    |       0     |       0    |
|2021-09-12 16:30:00|       0    |       0.2    |       0     |       0    |
|2021-09-12 16:35:00|       0.2  |       0.2    |       0     |       0    |
|2021-09-12 16:40:00|       0.3  |       0.2    |       0     |       0.2  |
|2021-09-12 16:45:00|       0.4  |       0.3    |       0     |       0    |

etc...



Sources can be very numerous (my example has 4 but there can be thousands of sources)

Please tell me if you need more information. Thanks a lot !

Assuming your data is accessible in a dataframe named logs you could achieve this with an initial group by on timestamp_rounded then a pivot on source to transpose your aggregated error rates to rows with columns for each source error rate for each timestamp_rounded . Finally, you may replace missing error rate values with 0.0

Before performing these transformations, we can transform your Yes / No values to 1 / 0 to simplify the aggregation/mean and rename the source column values with a prefix Error_rate_ to achieve the desired column names after the pivot.

NB. I changed 1 of your records in the sample data in the question

|      A     |             No           |2021-09-12T12:57:...|2021-09-12 15:00:00|

to

|      A     |             Yes           |2021-09-12T12:57:...|2021-09-12 15:00:00|

to receive more variation in your data. As a result your dataframe would look like this after your initial aggregation.

You may achieve this using the following:

output_df =(
    logs.withColumn("Error",F.when(F.col("Error")=="Yes",1).otherwise(0))
        .withColumn("Source",F.concat(F.lit("Error_rate_"),F.col("Source")))
        .groupBy("timestamp_rounded")
        .pivot("Source")
        .agg(
            F.round(F.mean("Error"),2).alias("Error_rate")
        )
        .na.fill(0.0)
)

Outputs

+-------------------+------------+------------+------------+------------+
|timestamp_rounded  |Error_rate_A|Error_rate_B|Error_rate_C|Error_rate_D|
+-------------------+------------+------------+------------+------------+
|2021-09-12 14:50:00|0.0         |0.0         |0.0         |0.0         |
|2021-09-12 16:15:00|1.0         |0.0         |0.0         |0.0         |
|2021-09-12 16:20:00|0.0         |0.0         |0.0         |0.0         |
|2021-09-12 16:25:00|0.5         |0.0         |0.0         |0.0         |
|2021-09-12 14:55:00|0.0         |0.0         |0.0         |0.0         |
|2021-09-12 14:45:00|0.0         |0.0         |0.0         |0.0         |
|2021-09-12 16:10:00|0.0         |0.0         |0.0         |0.0         |
|2021-09-12 15:00:00|0.33        |0.0         |0.0         |0.0         |
+-------------------+------------+------------+------------+------------+

NB. The output above is not ordered and can easily be ordered using .orderBy

Let me know if this works for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM