简体   繁体   中英

How to filter a time series pandas dataframe based on two columns?

So I have a time series data frame which has both Patient Ids in one column and Years ( 1 - 3) in another columns. What I want to do using pandas is, for each unique patient Id only keep rows that have entries for all the Years 1-3.

IN

ID  Year 
111  1 
111  2 
111  3 
222  1 
222  2
333  1
333  2
333  3

OUT

ID  Year 
111  1 
111  2 
111  3 
333  1
333  2
333  3

It can be done in one line like so:

df = pd.DataFrame({
    "ID": [111, 111, 111, 222, 222, 333, 333, 333],
    "Year": [1, 2, 3, 1, 2, 1, 2, 3]})

# filter unique IDs that doesn't contain [1, 2, 3]
df = df.groupby('ID').filter(lambda g: (sorted(set(g["Year"].values)) == [1,2,3]))
print(df)
#    ID  Year
#0  111     1
#1  111     2
#2  111     3
#5  333     1
#6  333     2
#7  333     3

Assuming that Year can be only 1 thru 3 , you can run:

result = df.groupby('ID').filter(lambda grp: grp.Year.unique().size == 3)

unique() is needed to filter out cases when there are eg 3 rows for particular ID , but all from the same Year .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM