[英]Python Pandas selecting rows in a dataframe based on the relative values of other fields
I have a dataframe that looks like this:我有一个看起来像这样的 dataframe:
df = pd.DataFrame({'ID': ['001', '001', '002', '002'],
'Flag': ['Y', 'N', 'N', 'Y'],
'Snapshot Month': ['05', '06', '01', '02']})
ID (not unique) ![]() |
Flag (Y/N)![]() |
Snapshot Month (unique for each ID)![]() |
---|---|---|
0001 ![]() |
Y![]() |
05 ![]() |
0001 ![]() |
N ![]() |
06 ![]() |
0002 ![]() |
N ![]() |
01 ![]() |
0002 ![]() |
Y![]() |
02 ![]() |
Data from all months are aggregated to one dataframe, so the IDs are not unique, and months range from 01 to 12 (01-12 are all included; I left out most of the months for brevity).所有月份的数据都汇总到一个 dataframe 中,因此 ID 不是唯一的,月份范围从 01 到 12(01-12 都包括在内;为简洁起见,我省略了大部分月份)。 The flag variable can only go from
Y
to N
, not the other way around. flag 变量只能从
Y
到N
的 go ,而不是相反。 Furthermore, we can assume the flag variable can only change once.此外,我们可以假设标志变量只能更改一次。
There are errors in the data.数据中有错误。 For example, ID 0002 is illegal, as it goes from
N
to Y
chronologically.例如,ID 0002 是非法的,因为它按时间顺序从
N
到Y
I want to be able to find out IDs corresponding to those data errors.我希望能够找出与这些数据错误相对应的 ID。
What I have tried is to find a dataframe consisting of Y
's, and N
's, and find the ID's in common, and go into the rows themselves to see errors has occurred.我尝试的是找到一个由
Y
和N
组成的 dataframe ,并找到共同的 ID,并将 go 放入行本身以查看是否发生错误。 But this method is not only inefficient but also impossible to scale as the data becomes large.但这种方法不仅效率低下,而且随着数据的变大,也无法扩展。
Since the snapshot month ranges from 01 - 12 (all data come from the same year), I computed a dataframe consisting of Y
's with snapshot month of 12, and checked to see if they have any N
's in months other than 12. However this also is too manual and does not find all answers.由于快照月份的范围是 01 - 12 (所有数据都来自同一年),我计算了一个 dataframe ,其中包含快照月份为 12 的
Y
,并检查它们在除 12 之外的月份中是否有任何N
. 然而这也太手动了,并没有找到所有的答案。 I wonder if there are some clever ways to use the snapshot month.我想知道是否有一些巧妙的方法来使用快照月。
Here's one approach:这是一种方法:
(i) set_index
with 'ID'
(i) 带有
'ID'
set_index
(ii) replace N
values with np.nan
(ii) 用
np.nan
替换N
值
(iii) groupby
"ID" (which is index now), and forward fill np.nan
values (iii)
groupby
"ID" (现在是索引),并向前填充np.nan
值
(iv) groupby
"ID" again and see if any group has NaN values (that means these groups have leading N
values) and if there are create a boolean mask with their "ID"s (iv) 再次按“ID”分组,查看是否有任何组具有 NaN 值(这意味着这些组具有前导
N
值)以及是否有创建带有“ID”的groupby
掩码
(v) Use the mask from (iv) on df
(v) 在
df
上使用 (iv) 中的掩码
df = df.set_index('ID')
mask = (df['Flag']
.replace('N', np.nan)
.groupby(level=0).ffill()
.groupby(level=0).transform(lambda x: x.isna().sum()>0))
out = df.index[mask].unique().tolist()
Output: Output:
['002']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.