简体   繁体   中英

Delete non-consecutive values from a dataframe column

I have a dataframe like this:

Ind TIME  PREC  ET    PET   YIELD
0      1  1.21  0.02  0.02   0.00
1      2  0.00  0.03  0.04   0.00
2      3  0.00  0.03  0.05   0.00
3      4  0.00  0.04  0.05   0.00
4      5  0.00  0.05  0.07   0.00
5      6  0.00  0.03  0.05   0.00
6      7  0.00  0.02  0.04   0.00
7      8  1.14  0.03  0.04   0.00
8      9  0.10  0.02  0.03   0.00
9     10  0.00  0.03  0.04   0.00
10    11  0.10  0.05  0.11   0.00
11    12  0.00  0.06  0.15   0.00
12    13  2.30  0.14  0.44   0.00
13    14  0.17  0.09  0.29   0.00
14    15  0.00  0.13  0.35   0.00
15    16  0.00  0.14  0.39   0.00
16    17  0.00  0.10  0.31   0.00
17    18  0.00  0.15  0.51   0.00
18    19  0.00  0.22  0.58   0.00
19    20  0.10  0.04  0.09   0.00
20    21  0.00  0.04  0.06   0.00
21    22  0.27  0.13  0.43   0.00
22    23  0.00  0.10  0.25   0.00
23    24  0.00  0.03  0.04   0.00
24    25  0.00  0.04  0.05   0.00
25    26  0.43  0.04  0.15   0.00
26    27  0.17  0.06  0.23   0.00
27    28  0.50  0.02  0.04   0.00
28    29  0.00  0.03  0.04   0.00
29    30  0.00  0.04  0.08   0.00
30    31  0.00  0.04  0.08   0.00
31     1  6.48  1.97  5.10   0.03
32    32  0.00  0.22  0.70   0.00
33    33  0.00  0.49  0.88   0.00

In this dataframe column 'TIME' shows ordinal day number in a year, and after the end of every month - an ordinal number of month in a year, which messes up all dataframe calculations, so, for this reason, I would like to drop all rows that contain month value. First, I tried to use .shift() :

df = df.loc[df.TIME == df.TIME.shift() +1] ,

however, in this case, I delete twice as many rows as it supposed to be. I also tried to delete every value after the end of every month:

for i in indexes:
    df = df.loc[df.index != i],

where indexes is a list, containing row indexes after day value is equal to 31, 59, ... 365 or end of every month. However, in a leap year, these values would be different, and I could create another list for a leap year, but this method would be very non-pythonist. So, I wonder, is there any better way to delete non-consecutive values from a dataframe (excluding when one year ends and another one starts: 364, 365, 1, 2)? EDIT: I should, probably, add that there are twenty years in this dataframe, so this is how the dataframe looks like at the end of each year:

TIME PREC ET PET YIELD 370 360 0.00 0.14 0.26 0.04 371 361 0.00 0.15 0.27 0.04 372 362 0.00 0.14 0.25 0.04 373 363 0.11 0.18 0.32 0.04 374 364 0.00 0.15 0.25 0.04 375 365 0.00 0.17 0.29 0.04 376 12 16.29 4.44 7.74 1.89 377 1 0.00 0.16 0.28 0.03 378 2 0.00 0.18 0.32 0.03 379 3 0.00 0.22 0.40 0.03

df

    TIME   PREC    ET   PET  YIELD
0    360   0.00  0.14  0.26   0.04
1    361   0.00  0.15  0.27   0.04
2    362   0.00  0.14  0.25   0.04
3    363   0.11  0.18  0.32   0.04
4    364   0.00  0.15  0.25   0.04
5    365   0.00  0.17  0.29   0.04
6     12  16.29  4.44  7.74   1.89
7      1   1.21  0.02  0.02   0.00
8      2   0.00  0.03  0.04   0.00
9      3   0.00  0.03  0.05   0.00
10     4   0.00  0.04  0.05   0.00
11     5   0.00  0.05  0.07   0.00
12     6   0.00  0.03  0.05   0.00
13     7   0.00  0.02  0.04   0.00
14     8   1.14  0.03  0.04   0.00
15     9   0.10  0.02  0.03   0.00
16    10   0.00  0.03  0.04   0.00
17    11   0.10  0.05  0.11   0.00
18    12   0.00  0.06  0.15   0.00
19    13   2.30  0.14  0.44   0.00
20    14   0.17  0.09  0.29   0.00
21    15   0.00  0.13  0.35   0.00
22    16   0.00  0.14  0.39   0.00
23    17   0.00  0.10  0.31   0.00
24    18   0.00  0.15  0.51   0.00
25    19   0.00  0.22  0.58   0.00
26    20   0.10  0.04  0.09   0.00
27    21   0.00  0.04  0.06   0.00
28    22   0.27  0.13  0.43   0.00
29    23   0.00  0.10  0.25   0.00
30    24   0.00  0.03  0.04   0.00
31    25   0.00  0.04  0.05   0.00
32    26   0.43  0.04  0.15   0.00
33    27   0.17  0.06  0.23   0.00
34    28   0.50  0.02  0.04   0.00
35    29   0.00  0.03  0.04   0.00
36    30   0.00  0.04  0.08   0.00
37    31   0.00  0.04  0.08   0.00
38     1   6.48  1.97  5.10   0.03
39    32   0.00  0.22  0.70   0.00
40    33   0.00  0.49  0.88   0.00

Look at the diffs in TIME . Drop the rows where diff is between -360 and -1

df[~df.TIME.diff().le(-12)]

    TIME  PREC    ET   PET  YIELD
0    360  0.00  0.14  0.26   0.04
1    361  0.00  0.15  0.27   0.04
2    362  0.00  0.14  0.25   0.04
3    363  0.11  0.18  0.32   0.04
4    364  0.00  0.15  0.25   0.04
5    365  0.00  0.17  0.29   0.04
7      1  1.21  0.02  0.02   0.00
8      2  0.00  0.03  0.04   0.00
9      3  0.00  0.03  0.05   0.00
10     4  0.00  0.04  0.05   0.00
11     5  0.00  0.05  0.07   0.00
12     6  0.00  0.03  0.05   0.00
13     7  0.00  0.02  0.04   0.00
14     8  1.14  0.03  0.04   0.00
15     9  0.10  0.02  0.03   0.00
16    10  0.00  0.03  0.04   0.00
17    11  0.10  0.05  0.11   0.00
18    12  0.00  0.06  0.15   0.00
19    13  2.30  0.14  0.44   0.00
20    14  0.17  0.09  0.29   0.00
21    15  0.00  0.13  0.35   0.00
22    16  0.00  0.14  0.39   0.00
23    17  0.00  0.10  0.31   0.00
24    18  0.00  0.15  0.51   0.00
25    19  0.00  0.22  0.58   0.00
26    20  0.10  0.04  0.09   0.00
27    21  0.00  0.04  0.06   0.00
28    22  0.27  0.13  0.43   0.00
29    23  0.00  0.10  0.25   0.00
30    24  0.00  0.03  0.04   0.00
31    25  0.00  0.04  0.05   0.00
32    26  0.43  0.04  0.15   0.00
33    27  0.17  0.06  0.23   0.00
34    28  0.50  0.02  0.04   0.00
35    29  0.00  0.03  0.04   0.00
36    30  0.00  0.04  0.08   0.00
37    31  0.00  0.04  0.08   0.00
39    32  0.00  0.22  0.70   0.00
40    33  0.00  0.49  0.88   0.00
df[df['TIME'].shift().fillna(0) <= df['TIME']]

Gives what you're looking for. You were almost there with

df.loc[df.TIME == df.TIME.shift() +1]

But you don't need to get rid of cases where .shift is smaller, because that's just the first of the month.

The addition of .fillna(0) takes care of the NaN in the first row of df['TIME'].shift() .

Edit:

For the end of year case, just be sure to also take those with a difference of 11, to catch where the 12th month ends. That would give

df[(df['TIME'].shift().fillna(0) <= df['TIME']+11)]

Edit2: By the by, I checked solution runtimes, and the current version( df[~df.TIME.diff().le(-12)] ) of @piRSquared's seems to run fastest.

For completeness, of the one presented in this post and the original version posted by @piRSquared, the former was a bit faster on datasets on the order of 10000 rows or fewer, the latter somewhat faster on those larger.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM