I have a dataframe of shape [600 000, 19]. I want to filter the first 100 000 rows based on one condition, the next 300 000 based on another condition, and a 3rd condition for the last rows. I was wondering how this can be done.
Currently, I split the data frame into 3 segments and apply their respective conditions. Then, I re-concatenate the data frame. Is there a better way?
Example: Filter first 100 000 rows based on any value less than 5. For second 300 000 rows, I dont want any values greater than 40, etc.
You can try the following approach:
import pandas as pd
sample = pd.DataFrame({'x' : pd.np.arange(100),
'colname': pd.np.arange(100)})
conditions = [('index < 5', 'colname < 3'),
('index > 50', 'index < 100', 'colname < 55')]
sample.query('|'.join(map(lambda x: '&'.join(x), conditions)))
On approach would be to use dataframe index slicing with pd.concat
to build complete boolean series:
import numpy as np
import pandas as pd
np.random.seed(0)
df=pd.DataFrame(np.random.randint(0,50,60))
df[pd.concat([df.iloc[:10] > 10, df[11:40] < 30, df[41:] % 2 == 0])]
Where first 10 records filters less than 10, next 30 values filters greater than 30, and last values check for even numbers.
Then you can use dropna to remove all the NaN values
Output:
0
0 44.0
1 47.0
2 NaN
3 NaN
4 NaN
5 39.0
6 NaN
7 19.0
8 21.0
9 36.0
10 NaN
11 6.0
12 24.0
13 24.0
14 12.0
15 1.0
16 NaN
17 NaN
18 23.0
19 NaN
20 24.0
21 17.0
22 NaN
23 25.0
24 13.0
25 8.0
26 9.0
27 20.0
28 16.0
29 5.0
30 15.0
31 NaN
32 0.0
33 18.0
34 NaN
35 24.0
36 NaN
37 29.0
38 19.0
39 19.0
40 NaN
41 NaN
42 32.0
43 NaN
44 NaN
45 32.0
46 NaN
47 10.0
48 NaN
49 NaN
50 NaN
51 28.0
52 34.0
53 0.0
54 0.0
55 36.0
56 NaN
57 38.0
58 40.0
59 NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.