So I have a time series data frame which has both Patient Ids in one column and Years ( 1 - 3) in another columns. What I want to do using pandas is, for each unique patient Id only keep rows that have entries for all the Years 1-3.
IN
ID Year
111 1
111 2
111 3
222 1
222 2
333 1
333 2
333 3
OUT
ID Year
111 1
111 2
111 3
333 1
333 2
333 3
It can be done in one line like so:
df = pd.DataFrame({
"ID": [111, 111, 111, 222, 222, 333, 333, 333],
"Year": [1, 2, 3, 1, 2, 1, 2, 3]})
# filter unique IDs that doesn't contain [1, 2, 3]
df = df.groupby('ID').filter(lambda g: (sorted(set(g["Year"].values)) == [1,2,3]))
print(df)
# ID Year
#0 111 1
#1 111 2
#2 111 3
#5 333 1
#6 333 2
#7 333 3
Assuming that Year can be only 1 thru 3 , you can run:
result = df.groupby('ID').filter(lambda grp: grp.Year.unique().size == 3)
unique()
is needed to filter out cases when there are eg 3 rows for particular ID , but all from the same Year .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.