So I have a dataframe that looks like this :
id epoch value duration
958 1819 2018-01-01 00:00:00.000 1 20
959 1820 2018-01-01 00:20:00.000 2 20
960 1821 2018-01-01 00:40:00.000 3 20
961 1822 2018-01-01 01:00:00.000 4 20
962 1823 2018-01-01 01:20:00.000 5 20
963 1824 2018-01-01 01:20:01.000 5.05 0.01
964 1825 2018-01-01 01:40:01.000 6 20
965 1826 2018-01-01 02:00:01.000 7 20
966 1827 2018-01-01 02:00:02.000 7.0012 0.01
967 1828 2018-01-01 02:20:02.000 8 20
So as you can see we have values that are 3-periodic, and i want to numerotate the periods in a new column by ignoring the 'outliers' that have a very short duration (but not removing the line).
Here's what I have :
id epoch value duration period
958 1819 2018-01-01 00:00:00.000 1 20 1
959 1820 2018-01-01 00:20:00.000 2 20 2
960 1821 2018-01-01 00:40:00.000 3 20 3
961 1822 2018-01-01 01:00:00.000 4 20 1
962 1823 2018-01-01 01:20:00.000 5 20 2
963 1824 2018-01-01 01:20:01.000 5.05 0.01 3
964 1825 2018-01-01 01:40:00.000 6 20 1
965 1826 2018-01-01 02:00:01.000 7 20 2
966 1827 2018-01-01 02:00:02.000 7.0012 0.01 3
967 1828 2018-01-01 02:20:02.000 8 20 1
And here's what I want :
id epoch value duration period
958 1819 2018-01-01 00:00:00.000 1 20 1
959 1820 2018-01-01 00:20:00.000 2 20 2
960 1821 2018-01-01 00:40:00.000 3 20 3
961 1822 2018-01-01 01:00:00.000 4 20 1
962 1823 2018-01-01 01:20:00.000 5 20 2
963 1824 2018-01-01 01:20:01.000 5.05 0.01 2
964 1825 2018-01-01 01:40:00.000 6 20 3
965 1826 2018-01-01 02:00:01.000 7 20 1
966 1827 2018-01-01 02:00:02.000 7.0012 0.01 1
967 1828 2018-01-01 02:20:02.000 8 20 2
I have already done this with 2 for loops but since the dataframe is large, I am searching for a faster way to do it.
Thank in advance
Edit : I added few more lines. To be clearer : some points are "duplicated" (they have nearly the same value as the previous one) si I need to put them in the same period as its double. Also, I can't remove them (maybe temporarily ?), I need to have them in the final dataframe.
A quick solution given the data you provide, way faster than using for
and considering the expected output is using np.where()
:
import pandas as pd
import numpy as np
d = {'value':[1,2,3,4,5,5.05,6],'dur':[20,20,20,20,20,0.01,20],'period':[1,2,3,1,2,3,1]}
df = pd.DataFrame(d)
df['period'] = np.where(df['dur']-df['dur'].astype(int) == 0,df['period'],df['period']-1)
print(df)
Output:
value dur period aux
0 1.00 20.00 1 1
1 2.00 20.00 2 2
2 3.00 20.00 3 3
3 4.00 20.00 1 1
4 5.00 20.00 2 2
5 5.05 0.01 3 2
6 6.00 20.00 1 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.