简体   繁体   中英

How to isolate periods with outliers on dataframe

So I have a dataframe that looks like this :

    id      epoch                      value    duration
958 1819    2018-01-01 00:00:00.000    1        20
959 1820    2018-01-01 00:20:00.000    2        20
960 1821    2018-01-01 00:40:00.000    3        20
961 1822    2018-01-01 01:00:00.000    4        20
962 1823    2018-01-01 01:20:00.000    5        20
963 1824    2018-01-01 01:20:01.000    5.05     0.01
964 1825    2018-01-01 01:40:01.000    6        20
965 1826    2018-01-01 02:00:01.000    7        20
966 1827    2018-01-01 02:00:02.000    7.0012   0.01
967 1828    2018-01-01 02:20:02.000    8        20

So as you can see we have values that are 3-periodic, and i want to numerotate the periods in a new column by ignoring the 'outliers' that have a very short duration (but not removing the line).

Here's what I have :

    id      epoch                      value    duration    period
958 1819    2018-01-01 00:00:00.000    1        20          1
959 1820    2018-01-01 00:20:00.000    2        20          2
960 1821    2018-01-01 00:40:00.000    3        20          3
961 1822    2018-01-01 01:00:00.000    4        20          1
962 1823    2018-01-01 01:20:00.000    5        20          2
963 1824    2018-01-01 01:20:01.000    5.05     0.01        3
964 1825    2018-01-01 01:40:00.000    6        20          1
965 1826    2018-01-01 02:00:01.000    7        20          2
966 1827    2018-01-01 02:00:02.000    7.0012   0.01        3
967 1828    2018-01-01 02:20:02.000    8        20          1

And here's what I want :

    id      epoch                      value    duration    period
958 1819    2018-01-01 00:00:00.000    1        20          1
959 1820    2018-01-01 00:20:00.000    2        20          2
960 1821    2018-01-01 00:40:00.000    3        20          3
961 1822    2018-01-01 01:00:00.000    4        20          1
962 1823    2018-01-01 01:20:00.000    5        20          2
963 1824    2018-01-01 01:20:01.000    5.05     0.01        2
964 1825    2018-01-01 01:40:00.000    6        20          3
965 1826    2018-01-01 02:00:01.000    7        20          1
966 1827    2018-01-01 02:00:02.000    7.0012   0.01        1
967 1828    2018-01-01 02:20:02.000    8        20          2

I have already done this with 2 for loops but since the dataframe is large, I am searching for a faster way to do it.

Thank in advance

Edit : I added few more lines. To be clearer : some points are "duplicated" (they have nearly the same value as the previous one) si I need to put them in the same period as its double. Also, I can't remove them (maybe temporarily ?), I need to have them in the final dataframe.

A quick solution given the data you provide, way faster than using for and considering the expected output is using np.where() :

import pandas as pd
import numpy as np
d = {'value':[1,2,3,4,5,5.05,6],'dur':[20,20,20,20,20,0.01,20],'period':[1,2,3,1,2,3,1]}
df = pd.DataFrame(d)
df['period'] = np.where(df['dur']-df['dur'].astype(int) == 0,df['period'],df['period']-1)
print(df)

Output:

   value    dur  period  aux
0   1.00  20.00       1    1
1   2.00  20.00       2    2
2   3.00  20.00       3    3
3   4.00  20.00       1    1
4   5.00  20.00       2    2
5   5.05   0.01       3    2
6   6.00  20.00       1    1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM