简体   繁体   English

比较两个数据框并根据多个条件删除行

[英]Compare two data-frames and removes rows based on multiple conditions

If I have two date-frames in the following format.如果我有两个以下格式的日期框。 df-a: df-a:

         ID         Start_Date     End_Date 
1        cd2        2020-06-01     2020-06-09              
2        cd2        2020-06-24     2020-07-21             
3        cd56       2020-06-10     2020-07-03              
4        cd915      2020-04-28     2020-07-21              
5        cd103      2020-04-13     2020-04-24

and df-b:和 df-b:

         ID         Date
1        cd2        2020-05-12                   
2        cd2        2020-04-12                  
3        cd2        2020-06-29                  
4        cd15       2020-04-28                   
5        cd193      2020-04-13     

        

I need to discard all rows for all IDs in df-b where they fall in various date ranges for the same ID in df-a.我需要丢弃 df-b 中所有 ID 的所有行,它们落在 df-a 中相同 ID 的不同日期范围内。 Ie ANSWER即回答

         ID         Date
1        cd2        2020-05-12                   
2        cd2        2020-04-12                  
                
4        cd15       2020-04-28                   
5        cd193      2020-04-13   

as ID cd2 is the only ID that matches in df-a with one date that fall within cd2's date ranges from df-a.因为 ID cd2 是唯一在 df-a 中与 df-a 中 cd2 的日期范围内的日期相匹配的 ID。

Sorry for the long-winded question.抱歉这个冗长的问题。 First time posting.第一次发帖。


I tried my best to understand your question, however I am confused by your sample answer. 我尽力理解您的问题,但我对您的示例答案感到困惑。
None of the IDs in df-b should be removed. 不应删除 df-b 中的任何 ID。 Even for row 3 of df-b, the date (2020-06-10) does not fall in the range of any start/end dates for ID cd2 in df-a. 即使对于 df-b 的第 3 行,日期 (2020-06-10) 也不在 df-a 中 ID cd2 的任何开始/结束日期范围内。

I did set up a similar example to what you provided with df-a being: 我确实设置了一个与您提供的 df-a 类似的示例:
 ID Start_Date End_Date 0 cd2 2020-06-01 2020-06-11 1 cd2 2020-06-24 2020-07-21 2 cd56 2020-06-10 2020-07-03 3 cd915 2020-04-28 2020-07-21 4 cd103 2020-04-13 2020-04-24

and df-b being: df-b 是:

 ID Date 0 cd2 2020-05-12 1 cd2 2020-04-12 2 cd2 2020-06-10 3 cd15 2020-04-28 4 cd193 2020-04-13

With this example, row 2 (0-based) of df-b should be removed since 2020-06-10 falls between 2020-06-01 and 2020-06-11 in row 0 of df-a.在这个例子中,df-b 的第 2 行(从 0 开始)应该被删除,因为 2020-06-10 在 df-a 的第 0 行中介于 2020-06-01 和 2020-06-11 之间。

Here's my code for doing the row deletions这是我执行行删除的代码

df_c = df_b.copy() for i in range(df_c.shape[0]): currentID = df_c.ID[i] currentDate = df_c.Date[i] df_a_entriesForCurrentID = df_a.loc[df_a.ID == currentID] for j in range(df_a_entriesForCurrentID.shape[0]): startDate = df_a_entriesForCurrentID.iloc[j,:].Start_Date endDate = df_a_entriesForCurrentID.iloc[j,:].End_Date if (startDate <= currentDate <= endDate): df_c = df_c.drop(i) print('dropped')

where df_c is the output DataFrame.其中 df_c 是 output DataFrame。

After running this, df_c should look like:运行此命令后,df_c 应如下所示:

 ID Date 0 cd2 2020-05-12 1 cd2 2020-04-12 3 cd15 2020-04-28 4 cd193 2020-04-13

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM