Pandas filter dataframe based on condition for the first n rows

Question

I have a dataframe of shape [600 000, 19]. I want to filter the first 100 000 rows based on one condition, the next 300 000 based on another condition, and a 3rd condition for the last rows. I was wondering how this can be done.

Currently, I split the data frame into 3 segments and apply their respective conditions. Then, I re-concatenate the data frame. Is there a better way?

Example: Filter first 100 000 rows based on any value less than 5. For second 300 000 rows, I dont want any values greater than 40, etc.

Answer 1

You can try the following approach:

import pandas as pd

sample = pd.DataFrame({'x' : pd.np.arange(100),
                       'colname': pd.np.arange(100)})
conditions = [('index < 5', 'colname < 3'), 
              ('index > 50', 'index < 100', 'colname < 55')]
sample.query('|'.join(map(lambda x: '&'.join(x), conditions)))

Answer 2

On approach would be to use dataframe index slicing with pd.concat to build complete boolean series:

import numpy as np
import pandas as pd
np.random.seed(0)
df=pd.DataFrame(np.random.randint(0,50,60))

df[pd.concat([df.iloc[:10] > 10, df[11:40] < 30, df[41:] % 2 == 0])]

Where first 10 records filters less than 10, next 30 values filters greater than 30, and last values check for even numbers.

Then you can use dropna to remove all the NaN values

Output:

Pandas filter dataframe based on condition for the first n rows

Question

2 answers

solution1
2 ACCPTED 2019-03-27 00:25:35

solution2
1 2019-03-27 00:50:32

Pandas filter dataframe based on condition for the first n rows

Question

2 answers

solution1 2 ACCPTED 2019-03-27 00:25:35

solution2 1 2019-03-27 00:50:32

solution1
2 ACCPTED 2019-03-27 00:25:35

solution2
1 2019-03-27 00:50:32