简体   繁体   中英

How to filter complex pandas dataframe with conditions?

I have a Dataframe looks like the following..

import pandas as pd
import numpy as np
# Create data set.
present = 12
died  = 20
dataSet = {'id': ['A', 'A', 'A','A','B','B','B','C'],
           'id_2': [1, 2, 3, 1, 1,2,3,1],
           'start' : [9,13,12,11,9,20,22,13],
           'end' : [14,22,21,19,10,30,24,18]}

# Create dataframe with data set and named columns.
df = pd.DataFrame(dataSet, columns= ['id', 'id_2', 'start','end'])

    id  id_2    start   end
0   A   1          9    14
1   A   2         13    22
2   A   3         12    21
3   A   1         11    19
4   B   1          9    10
5   B   2         20    30
6   B   3         22    24
7   C   1         13    18

we have present = 12, and died = 20 and i want to filter the dataframe in a following diagram.

在此处输入图像描述

where pink box represents df_begin, yellow for df_between purple for df_end.

I want to combine this, but I had to do it separately as following. (present and died inclusive)

df_start = df.loc[(df['start'] <= present) & (df['end'] >=present)]
df_between = df.loc[(df['start'] >= present) & (df['end'] <= died)]
df_end = df.loc[(df['start'] <= died) & (df['end'] >= died)]

concat these three dataframes and drop duplicate will give me 3 colored box combined which is I want, but is there way to do in a simple/better/fancy way? (ie. imagine this dataframe is more than 1million - performance also matters..)

Hence, desired output would be..

    id  id_2    start   end
0   A   1          9    14
1   A   2         13    22
2   A   3         12    21
3   A   1         11    19
4   B   2         20    30
5   C   1         13    18

Thank you!

IIUC, you can do it with two conditions and a negation with OR.

m1 = (df['end'] < present) #Find all ranges that end before present
m2 = (df['start'] > died)  #Find all ranges that start after died

df[~(m1|m2)] #Negate to find all ranges that insert and overlap present to died

Output:

  id  id_2  start  end
0  A     1      9   14
1  A     2     13   22
2  A     3     12   21
3  A     1     11   19
5  B     2     20   30
7  C     1     13   18

If I understand this correctly, you are looking to have three separates dataframes according to the logic specified, and at the same time be able to concat them into a single dataframe with no duplicates rapidly.

You can save your conditions as masks:

df_start_mask = (df['start'] <= present) & (df['end'] >=present)
df_between_mask = (df['start'] >= present) & (df['end'] <= died)
df_end_mask = (df['start'] <= died) & (df['end'] >= died)

Creating the three separate dataframes is similar to before:

df_start = df.loc[df_start_mask]
df_between = df.loc[df_between_mask]
df_end = df.loc[df_end_mask]

However creating the combined dataframe is much faster, because instead of having to concat, you can directly index from your original dataframe:

combined_df = df.loc[df_start_mask | df_between_mask | df_end_mask ]

Which returns the intended result:

>>> print(combined_df)
  id  id_2  start  end
0  A     1      9   14
1  A     2     13   22
2  A     3     12   21
3  A     1     11   19
5  B     2     20   30
7  C     1     13   18

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM