简体   繁体   中英

Python: How to index every nth interval of a dataframe?

(complete code snippet at the end of the question)

I've got a pandas dataframe with unwanted values that occur in a regular interval with a regular distance between them. How can I index them and remove or replace them with for example np.nan ? The answers to the question Python: how to remove/delete every n-th element from list? shows several ways to remove every n th element from a list. But what if n is not an integer , but an interval ? I was sure that for a = [1,2,3,4,5,6,7,8,9,10] , that perhaps a[k-1::k] would be a good starting point

But if I've got a dataframe such as:

                  date  data
0  2020-01-01 00:00:00    66
1  2020-01-01 01:00:00    92
2  2020-01-01 02:00:00    98
3  2020-01-01 03:00:00    17
4  2020-01-01 04:00:00    83
5  2020-01-01 05:00:00    57
6  2020-01-01 06:00:00    86
7  2020-01-01 07:00:00    97
8  2020-01-01 08:00:00    96
9  2020-01-01 09:00:00    47
10 2020-01-01 10:00:00    73
11 2020-01-01 11:00:00    32
12 2020-01-01 12:00:00    46
13 2020-01-01 13:00:00    96
14 2020-01-01 14:00:00    25
15 2020-01-01 15:00:00    83
16 2020-01-01 16:00:00    78
17 2020-01-01 17:00:00    36
18 2020-01-01 18:00:00    96
19 2020-01-01 19:00:00    80
20 2020-01-01 20:00:00    68
21 2020-01-01 21:00:00    49
22 2020-01-01 22:00:00    55
23 2020-01-01 23:00:00    67

And run:

k1=10
df.iloc[k1-1::k1]=np.nan

k2=11
df.iloc[k2-1::k2]=np.nan

Then I get:

                  date  data
0  2020-01-01 00:00:00  66.0
1  2020-01-01 01:00:00  92.0
2  2020-01-01 02:00:00  98.0
3  2020-01-01 03:00:00  17.0
4  2020-01-01 04:00:00  83.0
5  2020-01-01 05:00:00  57.0
6  2020-01-01 06:00:00  86.0
7  2020-01-01 07:00:00  97.0
8  2020-01-01 08:00:00  96.0
9                  NaT   NaN
10                 NaT   NaN
11 2020-01-01 11:00:00  32.0
12 2020-01-01 12:00:00  46.0
13 2020-01-01 13:00:00  96.0
14 2020-01-01 14:00:00  25.0
15 2020-01-01 15:00:00  83.0
16 2020-01-01 16:00:00  78.0
17 2020-01-01 17:00:00  36.0
18 2020-01-01 18:00:00  96.0
19                 NaT   NaN
20 2020-01-01 20:00:00  68.0
21                 NaT   NaN
22 2020-01-01 22:00:00  55.0
23 2020-01-01 23:00:00  67.0

So for the first replacement, the values at index [9,10] are replaced with Nan . But for the second replacement I get this:

19                 NaT   NaN
20 2020-01-01 20:00:00  68.0
21                 NaT   NaN

How can I index the dataframe so that index 20 is assigned Nan and not index 21 ? And how can I make this stable for every tenth and eleventh value (so in this case the length of the interval is 2) if I've got a bigger dataframe?

Desired output:

                  date  data
0  2020-01-01 00:00:00  66.0
1  2020-01-01 01:00:00  92.0
2  2020-01-01 02:00:00  98.0
3  2020-01-01 03:00:00  17.0
4  2020-01-01 04:00:00  83.0
5  2020-01-01 05:00:00  57.0
6  2020-01-01 06:00:00  86.0
7  2020-01-01 07:00:00  97.0
8  2020-01-01 08:00:00  96.0
9  2020-01-01 09:00:00  47.0
10                 NaT   NaN
11                 NaT   NaN
12 2020-01-01 12:00:00  46.0
13 2020-01-01 13:00:00  96.0
14 2020-01-01 14:00:00  25.0
15 2020-01-01 15:00:00  83.0
16 2020-01-01 16:00:00  78.0
17 2020-01-01 17:00:00  36.0
18 2020-01-01 18:00:00  96.0
19 2020-01-01 19:00:00  80.0
20                 NaT   NaN
21                 NaT   NaN
22 2020-01-01 22:00:00  55.0
23 2020-01-01 23:00:00  67.0

Thank you for any suggestions!

Complete code:

# imports
import pandas as pd
import numpy as np

# data
np.random.seed(123)
dates = pd.date_range("2020.01.01", "2020.01.02", freq="1h")
dates=dates[:-1]
df = pd.DataFrame({'date':dates,
                   'data':np.random.randint(low=0, high=100, size=len(dates)).tolist()})

# indexing attempts
k1=10
df.iloc[k1-1::k1]=np.nan

k2=11
df.iloc[k2-1::k2]=np.nan

df

I think you need for each 10th value use starting by 10 value with step 10 and for second starting by 11th value and step same - 10 :

k1=10
df.iloc[k1::10]=np.nan
k2=11
df.iloc[k2::10]=np.nan

Your solution create each 11th value (step 11), so it is expected output.

EDIT: You can use modulo and integer division for set each 11 and 12 rows with omiting first and second row by:

r = np.arange(len(df))
mask = np.in1d(r % 10, [0,1]) & (r // 10 > 0)

df[mask] = np.nan

Sample :

np.random.seed(123)
dates = pd.date_range("2020.01.01", "2020.01.02", freq="30T")
dates=dates[:-1]
df = pd.DataFrame({'date':dates,
                   'data':np.random.randint(low=0, high=100, size=len(dates)).tolist()})


k1=10
df.iloc[k1::10]=np.nan
k2=11
df.iloc[k2::10]=np.nan

print (df)
                  date  data
0  2020-01-01 00:00:00  66.0
1  2020-01-01 00:30:00  92.0
2  2020-01-01 01:00:00  98.0
3  2020-01-01 01:30:00  17.0
4  2020-01-01 02:00:00  83.0
5  2020-01-01 02:30:00  57.0
6  2020-01-01 03:00:00  86.0
7  2020-01-01 03:30:00  97.0
8  2020-01-01 04:00:00  96.0
9  2020-01-01 04:30:00  47.0
10                 NaT   NaN
11                 NaT   NaN
12 2020-01-01 06:00:00  46.0
13 2020-01-01 06:30:00  96.0
14 2020-01-01 07:00:00  25.0
15 2020-01-01 07:30:00  83.0
16 2020-01-01 08:00:00  78.0
17 2020-01-01 08:30:00  36.0
18 2020-01-01 09:00:00  96.0
19 2020-01-01 09:30:00  80.0
20                 NaT   NaN
21                 NaT   NaN
22 2020-01-01 11:00:00  55.0
23 2020-01-01 11:30:00  67.0
24 2020-01-01 12:00:00   2.0
25 2020-01-01 12:30:00  84.0
26 2020-01-01 13:00:00  39.0
27 2020-01-01 13:30:00  66.0
28 2020-01-01 14:00:00  84.0
29 2020-01-01 14:30:00  47.0
30                 NaT   NaN
31                 NaT   NaN
32 2020-01-01 16:00:00   7.0
33 2020-01-01 16:30:00  99.0
34 2020-01-01 17:00:00  92.0
35 2020-01-01 17:30:00  52.0
36 2020-01-01 18:00:00  97.0
37 2020-01-01 18:30:00  85.0
38 2020-01-01 19:00:00  94.0
39 2020-01-01 19:30:00  27.0
40                 NaT   NaN
41                 NaT   NaN
42 2020-01-01 21:00:00  76.0
43 2020-01-01 21:30:00  40.0
44 2020-01-01 22:00:00   3.0
45 2020-01-01 22:30:00  69.0
46 2020-01-01 23:00:00  64.0
47 2020-01-01 23:30:00  75.0

So for the first replacement, the values at index [9,10] are replaced with Nan.

No. After

k1=10
df.iloc[k1-1::k1]=np.nan

the relevant indexes are list(range(k1-1, len(fd), k1) so 9, 19

And for k2=11 , the indices are list(range(k2-1, len(fd), k2) so 10, 21

What you want is

k1=9
df.iloc[sum([[i, i+1] for i in range(k1, len(df), 10)], [])]

you can avoid the sum on lists with:

df.iloc[i+j for j in (0,1) for i in range(k1, len(df), 10)], [])]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM