简体   繁体   中英

Pandas: Clean way to calculate running error based on boolean column of a dataframe?

I have a dataframe that is sourced from a JMeter report. The index is a time series. There is a column called "success" that is true if the underlying sample was successful, and false otherwise.

在此处输入图片说明

So I have two steps here. The first is I need a running count of the number of rows where success is false. Then I need to divide the number by the number of samples so far:

在此处输入图片说明

I solved the first step by inverting the success column, converting to an int, and then running cumsum

df['fail_count'] = (~df['success']).astype(int).cumsum()

Is there a cleaner way to solve the second piece, dividing by the number of samples, than by add ing a static column of one, adding a cumsum column over that one, then doing the division?

    df['fail_count'] = (~df['success']).astype(int).cumsum()
    df['one'] = 1
    df['sample_num'] = df['one'].cumsum()
    df['error_rate'] = df['fail_count'].div(df['sample_num'])

Just perform vectorized division directly:

df['error_rate'] = df['fail_count'] / np.array(range(1, len(df) + 1))

Alternatively, you can also do the following provided your index is the default [0,1,2,3,...] . If it is not, perform df.reset_index(inplace=True) before the calculation will get the job done, but unnecessarily messing with the index is of course not recommended.

df['error_rate'] = df['fail_count'] / (df.index + 1)

Code

import pandas as pd

df = pd.DataFrame(
    data={
        "fail_count": [0,0,1,1,2,2]
    }
)

df['error_rate'] = df['fail_count'] / np.array(range(1, len(df) + 1))
# or
# df['error_rate'] = df['fail_count'] / (df.index + 1)

Output

df
Out[15]: 
   fail_count  error_rate
0           0    0.000000
1           0    0.000000
2           1    0.333333
3           1    0.250000
4           2    0.400000
5           2    0.333333

To compute fail_count use:

df['fail_count'] = (~df.success).cumsum()

Almost like in your code sample, but remember that bool is actually a subtype of int ( True is 1 and False is 0 ) so can compute the sum (also cumulative) directly from a bool column.

And as far as error_rate is concerned, I see 2 simple solutions:

  1. If you can rely on the index (a sequence of numbers from 0 ), run:

     df['error_rate'] = df.fail_count / (df.index + 1)
  2. Otherwise you can generate such a temporary index of proper size, starting from 1 (instead of 0 ), so you don't need " + 1 " later, and use it just the same way:

     df['error_rate'] = df.fail_count / pd.RangeIndex(1, df.index.size + 1)

Decide yourself which variant to choose.

So is is enough to use just the 2 above instructions instead of your 4.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM