简体   繁体   中英

Python Pandas selecting rows in a dataframe based on the relative values of other fields

I have a dataframe that looks like this:

df = pd.DataFrame({'ID': ['001', '001', '002', '002'],
 'Flag': ['Y', 'N', 'N', 'Y'],
 'Snapshot Month': ['05', '06', '01', '02']})
ID (not unique) Flag (Y/N) Snapshot Month (unique for each ID)
0001 Y 05
0001 N 06
0002 N 01
0002 Y 02

Data from all months are aggregated to one dataframe, so the IDs are not unique, and months range from 01 to 12 (01-12 are all included; I left out most of the months for brevity). The flag variable can only go from Y to N , not the other way around. Furthermore, we can assume the flag variable can only change once.

There are errors in the data. For example, ID 0002 is illegal, as it goes from N to Y chronologically.

I want to be able to find out IDs corresponding to those data errors.

What I have tried is to find a dataframe consisting of Y 's, and N 's, and find the ID's in common, and go into the rows themselves to see errors has occurred. But this method is not only inefficient but also impossible to scale as the data becomes large.

Since the snapshot month ranges from 01 - 12 (all data come from the same year), I computed a dataframe consisting of Y 's with snapshot month of 12, and checked to see if they have any N 's in months other than 12. However this also is too manual and does not find all answers. I wonder if there are some clever ways to use the snapshot month.

Here's one approach:

(i) set_index with 'ID'

(ii) replace N values with np.nan

(iii) groupby "ID" (which is index now), and forward fill np.nan values

(iv) groupby "ID" again and see if any group has NaN values (that means these groups have leading N values) and if there are create a boolean mask with their "ID"s

(v) Use the mask from (iv) on df

df = df.set_index('ID')
mask = (df['Flag']
        .replace('N', np.nan)
        .groupby(level=0).ffill()
        .groupby(level=0).transform(lambda x: x.isna().sum()>0))
out = df.index[mask].unique().tolist()

Output:

['002']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM