简体   繁体   English

Python Pandas 根据其他字段的相对值选择 dataframe 中的行

[英]Python Pandas selecting rows in a dataframe based on the relative values of other fields

I have a dataframe that looks like this:我有一个看起来像这样的 dataframe:

df = pd.DataFrame({'ID': ['001', '001', '002', '002'],
 'Flag': ['Y', 'N', 'N', 'Y'],
 'Snapshot Month': ['05', '06', '01', '02']})
ID (not unique) ID(不是唯一的) Flag (Y/N)标志(是/否) Snapshot Month (unique for each ID)快照月份(每个 ID 唯一)
0001 0001 Y 05 05
0001 0001 N ñ 06 06
0002 0002 N ñ 01 01
0002 0002 Y 02 02

Data from all months are aggregated to one dataframe, so the IDs are not unique, and months range from 01 to 12 (01-12 are all included; I left out most of the months for brevity).所有月份的数据都汇总到一个 dataframe 中,因此 ID 不是唯一的,月份范围从 01 到 12(01-12 都包括在内;为简洁起见,我省略了大部分月份)。 The flag variable can only go from Y to N , not the other way around. flag 变量只能从YN的 go ,而不是相反。 Furthermore, we can assume the flag variable can only change once.此外,我们可以假设标志变量只能更改一次。

There are errors in the data.数据中有错误。 For example, ID 0002 is illegal, as it goes from N to Y chronologically.例如,ID 0002 是非法的,因为它按时间顺序从NY

I want to be able to find out IDs corresponding to those data errors.我希望能够找出与这些数据错误相对应的 ID。

What I have tried is to find a dataframe consisting of Y 's, and N 's, and find the ID's in common, and go into the rows themselves to see errors has occurred.我尝试的是找到一个由YN组成的 dataframe ,并找到共同的 ID,并将 go 放入行本身以查看是否发生错误。 But this method is not only inefficient but also impossible to scale as the data becomes large.但这种方法不仅效率低下,而且随着数据的变大,也无法扩展。

Since the snapshot month ranges from 01 - 12 (all data come from the same year), I computed a dataframe consisting of Y 's with snapshot month of 12, and checked to see if they have any N 's in months other than 12. However this also is too manual and does not find all answers.由于快照月份的范围是 01 - 12 (所有数据都来自同一年),我计算了一个 dataframe ,其中包含快照月份为 12 的Y ,并检查它们在除 12 之外的月份中是否有任何N . 然而这也太手动了,并没有找到所有的答案。 I wonder if there are some clever ways to use the snapshot month.我想知道是否有一些巧妙的方法来使用快照月。

Here's one approach:这是一种方法:

(i) set_index with 'ID' (i) 带有'ID' set_index

(ii) replace N values with np.nan (ii) 用np.nan替换N

(iii) groupby "ID" (which is index now), and forward fill np.nan values (iii) groupby "ID" (现在是索引),并向前填充np.nan

(iv) groupby "ID" again and see if any group has NaN values (that means these groups have leading N values) and if there are create a boolean mask with their "ID"s (iv) 再次按“ID”分组,查看是否有任何组具有 NaN 值(这意味着这些组具有前导N值)以及是否有创建带有“ID”的groupby掩码

(v) Use the mask from (iv) on df (v) 在df上使用 (iv) 中的掩码

df = df.set_index('ID')
mask = (df['Flag']
        .replace('N', np.nan)
        .groupby(level=0).ffill()
        .groupby(level=0).transform(lambda x: x.isna().sum()>0))
out = df.index[mask].unique().tolist()

Output: Output:

['002']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM