(complete code snippet at the end of the question)
I've got a pandas dataframe with unwanted values that occur in a regular interval with a regular distance between them. How can I index them and remove or replace them with for example np.nan
? The answers to the question Python: how to remove/delete every n-th element from list? shows several ways to remove every n
th element from a list. But what if n
is not an integer , but an interval ? I was sure that for a = [1,2,3,4,5,6,7,8,9,10]
, that perhaps a[k-1::k]
would be a good starting point
But if I've got a dataframe such as:
date data
0 2020-01-01 00:00:00 66
1 2020-01-01 01:00:00 92
2 2020-01-01 02:00:00 98
3 2020-01-01 03:00:00 17
4 2020-01-01 04:00:00 83
5 2020-01-01 05:00:00 57
6 2020-01-01 06:00:00 86
7 2020-01-01 07:00:00 97
8 2020-01-01 08:00:00 96
9 2020-01-01 09:00:00 47
10 2020-01-01 10:00:00 73
11 2020-01-01 11:00:00 32
12 2020-01-01 12:00:00 46
13 2020-01-01 13:00:00 96
14 2020-01-01 14:00:00 25
15 2020-01-01 15:00:00 83
16 2020-01-01 16:00:00 78
17 2020-01-01 17:00:00 36
18 2020-01-01 18:00:00 96
19 2020-01-01 19:00:00 80
20 2020-01-01 20:00:00 68
21 2020-01-01 21:00:00 49
22 2020-01-01 22:00:00 55
23 2020-01-01 23:00:00 67
And run:
k1=10
df.iloc[k1-1::k1]=np.nan
k2=11
df.iloc[k2-1::k2]=np.nan
Then I get:
date data
0 2020-01-01 00:00:00 66.0
1 2020-01-01 01:00:00 92.0
2 2020-01-01 02:00:00 98.0
3 2020-01-01 03:00:00 17.0
4 2020-01-01 04:00:00 83.0
5 2020-01-01 05:00:00 57.0
6 2020-01-01 06:00:00 86.0
7 2020-01-01 07:00:00 97.0
8 2020-01-01 08:00:00 96.0
9 NaT NaN
10 NaT NaN
11 2020-01-01 11:00:00 32.0
12 2020-01-01 12:00:00 46.0
13 2020-01-01 13:00:00 96.0
14 2020-01-01 14:00:00 25.0
15 2020-01-01 15:00:00 83.0
16 2020-01-01 16:00:00 78.0
17 2020-01-01 17:00:00 36.0
18 2020-01-01 18:00:00 96.0
19 NaT NaN
20 2020-01-01 20:00:00 68.0
21 NaT NaN
22 2020-01-01 22:00:00 55.0
23 2020-01-01 23:00:00 67.0
So for the first replacement, the values at index [9,10] are replaced with Nan
. But for the second replacement I get this:
19 NaT NaN
20 2020-01-01 20:00:00 68.0
21 NaT NaN
How can I index the dataframe so that index 20
is assigned Nan
and not index 21
? And how can I make this stable for every tenth and eleventh value (so in this case the length of the interval is 2) if I've got a bigger dataframe?
Desired output:
date data
0 2020-01-01 00:00:00 66.0
1 2020-01-01 01:00:00 92.0
2 2020-01-01 02:00:00 98.0
3 2020-01-01 03:00:00 17.0
4 2020-01-01 04:00:00 83.0
5 2020-01-01 05:00:00 57.0
6 2020-01-01 06:00:00 86.0
7 2020-01-01 07:00:00 97.0
8 2020-01-01 08:00:00 96.0
9 2020-01-01 09:00:00 47.0
10 NaT NaN
11 NaT NaN
12 2020-01-01 12:00:00 46.0
13 2020-01-01 13:00:00 96.0
14 2020-01-01 14:00:00 25.0
15 2020-01-01 15:00:00 83.0
16 2020-01-01 16:00:00 78.0
17 2020-01-01 17:00:00 36.0
18 2020-01-01 18:00:00 96.0
19 2020-01-01 19:00:00 80.0
20 NaT NaN
21 NaT NaN
22 2020-01-01 22:00:00 55.0
23 2020-01-01 23:00:00 67.0
Thank you for any suggestions!
Complete code:
# imports
import pandas as pd
import numpy as np
# data
np.random.seed(123)
dates = pd.date_range("2020.01.01", "2020.01.02", freq="1h")
dates=dates[:-1]
df = pd.DataFrame({'date':dates,
'data':np.random.randint(low=0, high=100, size=len(dates)).tolist()})
# indexing attempts
k1=10
df.iloc[k1-1::k1]=np.nan
k2=11
df.iloc[k2-1::k2]=np.nan
df
I think you need for each 10th
value use starting by 10
value with step 10
and for second starting by 11th
value and step same - 10
:
k1=10
df.iloc[k1::10]=np.nan
k2=11
df.iloc[k2::10]=np.nan
Your solution create each 11th
value (step 11), so it is expected output.
EDIT: You can use modulo and integer division for set each 11
and 12
rows with omiting first and second row by:
r = np.arange(len(df))
mask = np.in1d(r % 10, [0,1]) & (r // 10 > 0)
df[mask] = np.nan
Sample :
np.random.seed(123)
dates = pd.date_range("2020.01.01", "2020.01.02", freq="30T")
dates=dates[:-1]
df = pd.DataFrame({'date':dates,
'data':np.random.randint(low=0, high=100, size=len(dates)).tolist()})
k1=10
df.iloc[k1::10]=np.nan
k2=11
df.iloc[k2::10]=np.nan
print (df)
date data
0 2020-01-01 00:00:00 66.0
1 2020-01-01 00:30:00 92.0
2 2020-01-01 01:00:00 98.0
3 2020-01-01 01:30:00 17.0
4 2020-01-01 02:00:00 83.0
5 2020-01-01 02:30:00 57.0
6 2020-01-01 03:00:00 86.0
7 2020-01-01 03:30:00 97.0
8 2020-01-01 04:00:00 96.0
9 2020-01-01 04:30:00 47.0
10 NaT NaN
11 NaT NaN
12 2020-01-01 06:00:00 46.0
13 2020-01-01 06:30:00 96.0
14 2020-01-01 07:00:00 25.0
15 2020-01-01 07:30:00 83.0
16 2020-01-01 08:00:00 78.0
17 2020-01-01 08:30:00 36.0
18 2020-01-01 09:00:00 96.0
19 2020-01-01 09:30:00 80.0
20 NaT NaN
21 NaT NaN
22 2020-01-01 11:00:00 55.0
23 2020-01-01 11:30:00 67.0
24 2020-01-01 12:00:00 2.0
25 2020-01-01 12:30:00 84.0
26 2020-01-01 13:00:00 39.0
27 2020-01-01 13:30:00 66.0
28 2020-01-01 14:00:00 84.0
29 2020-01-01 14:30:00 47.0
30 NaT NaN
31 NaT NaN
32 2020-01-01 16:00:00 7.0
33 2020-01-01 16:30:00 99.0
34 2020-01-01 17:00:00 92.0
35 2020-01-01 17:30:00 52.0
36 2020-01-01 18:00:00 97.0
37 2020-01-01 18:30:00 85.0
38 2020-01-01 19:00:00 94.0
39 2020-01-01 19:30:00 27.0
40 NaT NaN
41 NaT NaN
42 2020-01-01 21:00:00 76.0
43 2020-01-01 21:30:00 40.0
44 2020-01-01 22:00:00 3.0
45 2020-01-01 22:30:00 69.0
46 2020-01-01 23:00:00 64.0
47 2020-01-01 23:30:00 75.0
So for the first replacement, the values at index [9,10] are replaced with Nan.
No. After
k1=10
df.iloc[k1-1::k1]=np.nan
the relevant indexes are list(range(k1-1, len(fd), k1)
so 9, 19
And for k2=11
, the indices are list(range(k2-1, len(fd), k2)
so 10, 21
What you want is
k1=9
df.iloc[sum([[i, i+1] for i in range(k1, len(df), 10)], [])]
you can avoid the sum
on lists with:
df.iloc[i+j for j in (0,1) for i in range(k1, len(df), 10)], [])]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.