I have a dataframe that is sourced from a JMeter report. The index is a time series. There is a column called "success" that is true if the underlying sample was successful, and false otherwise.
So I have two steps here. The first is I need a running count of the number of rows where success is false. Then I need to divide the number by the number of samples so far:
I solved the first step by inverting the success column, converting to an int, and then running cumsum
df['fail_count'] = (~df['success']).astype(int).cumsum()
Is there a cleaner way to solve the second piece, dividing by the number of samples, than by add ing a static column of one, adding a cumsum column over that one, then doing the division?
df['fail_count'] = (~df['success']).astype(int).cumsum()
df['one'] = 1
df['sample_num'] = df['one'].cumsum()
df['error_rate'] = df['fail_count'].div(df['sample_num'])
Just perform vectorized division directly:
df['error_rate'] = df['fail_count'] / np.array(range(1, len(df) + 1))
Alternatively, you can also do the following provided your index is the default [0,1,2,3,...]
. If it is not, perform df.reset_index(inplace=True)
before the calculation will get the job done, but unnecessarily messing with the index is of course not recommended.
df['error_rate'] = df['fail_count'] / (df.index + 1)
Code
import pandas as pd
df = pd.DataFrame(
data={
"fail_count": [0,0,1,1,2,2]
}
)
df['error_rate'] = df['fail_count'] / np.array(range(1, len(df) + 1))
# or
# df['error_rate'] = df['fail_count'] / (df.index + 1)
Output
df
Out[15]:
fail_count error_rate
0 0 0.000000
1 0 0.000000
2 1 0.333333
3 1 0.250000
4 2 0.400000
5 2 0.333333
To compute fail_count use:
df['fail_count'] = (~df.success).cumsum()
Almost like in your code sample, but remember that bool is actually a subtype of int ( True is 1 and False is 0 ) so can compute the sum (also cumulative) directly from a bool column.
And as far as error_rate is concerned, I see 2 simple solutions:
If you can rely on the index (a sequence of numbers from 0 ), run:
df['error_rate'] = df.fail_count / (df.index + 1)
Otherwise you can generate such a temporary index of proper size, starting from 1 (instead of 0 ), so you don't need " + 1 " later, and use it just the same way:
df['error_rate'] = df.fail_count / pd.RangeIndex(1, df.index.size + 1)
Decide yourself which variant to choose.
So is is enough to use just the 2 above instructions instead of your 4.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.