I will try my best to explain what I need help with. I have the following df (thousands if not millions of rows) with a datetime index like the sample below:
INDEX COL A COL B
2018-05-07 21:53:13.731 0.365127 9391.800000
2018-05-07 21:53:16.201 0.666127 9391.800000
2018-05-07 21:53:18.038 0.143104 9391.800000
2018-05-07 21:53:18.243 0.025643 9391.800000
2018-05-07 21:53:18.265 0.640484 9391.800000
2018-05-07 21:53:18.906 -0.100000 9391.793421
2018-05-07 21:53:19.829 0.559516 9391.800000
2018-05-07 21:53:19.846 0.100000 9391.800000
2018-05-07 21:53:19.870 0.006560 9391.800000
2018-05-07 21:53:20.734 0.666076 9391.800000
2018-05-07 21:53:20.775 0.666076 9391.800000
2018-05-07 21:53:28.607 0.100000 9391.800000
2018-05-07 21:53:28.610 0.041991 9391.800000
2018-05-07 21:53:29.283 -0.053518 9391.793421
2018-05-07 21:53:47.322 -0.046302 9391.793421
2018-05-07 21:53:49.182 0.100000 9391.800000
What I would like to do is group the rows in 5 second intervals and perform (sometimes complex) calculations on each 5 second interval/subset.
Let's say for example I want to calculate the percentage of positive vs negative values in column A within each 5 second block.
2018-05-07 21:53:10
to 2018-05-07 21:53:15
only contains one row and column A is a positive so I would create a new column C with 100%
.
Similarly 2018-05-07 21:53:15
to 2018-05-07 21:53:20
has 8 rows in column A, 7 which are positive and 1 of which is negative. So column C would be 87.5%
.
I would post sample code but I'm really unsure the best way to do this. A sample output (new df) may be something like the below with COL D being simply the minimum number in COL B for that 5 second grouping:
INDEX COL C COL D (MIN)
2018-05-07 21:53:10 100% 9391.800000
2018-05-07 21:53:15 12.5% 9391.793421
2018-05-07 21:53:20 100% 9391.800000
2018-05-07 21:53:25 66.7% 9391.793421
2018-05-07 21:53:30 nan nan
2018-05-07 21:53:35 nan nan
2018-05-07 21:53:40 nan nan
2018-05-07 21:53:45 100% 9391.793421
Please keep in mind I want to do many different calculations over each grouping. So using built-in .sum()
, .mean()
, .agg()
etc will not suffice for more complex calculations.
Appreciate any help and am happy to clarify the question if needed.
I believe need for percentage of positive values need mean of values >0
:
df = df.resample('5S').agg({'COL A': lambda x: (x > 0).mean() * 100, 'COL B': 'min'})
print (df)
COL A COL B
INDEX
2018-05-07 21:53:10 100.000000 9391.800000
2018-05-07 21:53:15 87.500000 9391.793421
2018-05-07 21:53:20 100.000000 9391.800000
2018-05-07 21:53:25 66.666667 9391.793421
2018-05-07 21:53:30 NaN NaN
2018-05-07 21:53:35 NaN NaN
2018-05-07 21:53:40 NaN NaN
2018-05-07 21:53:45 50.000000 9391.793421
and for percentage of negative values need mean of <0
:
df = df.resample('5S').agg({'COL A': lambda x: (x < 0).mean() * 100, 'COL B': 'min'})
print (df)
COL A COL B
INDEX
2018-05-07 21:53:10 0.000000 9391.800000
2018-05-07 21:53:15 12.500000 9391.793421
2018-05-07 21:53:20 0.000000 9391.800000
2018-05-07 21:53:25 33.333333 9391.793421
2018-05-07 21:53:30 NaN NaN
2018-05-07 21:53:35 NaN NaN
2018-05-07 21:53:40 NaN NaN
2018-05-07 21:53:45 50.000000 9391.793421
As @Alexander pointed 0
is neither positive nor negative. So the best is remove it before count:
df = df.resample('5S').agg({'COL A': lambda x: (x[x.ne(0)] > 0).mean() * 100, 'COL B': 'min'})
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.