简体   繁体   English

熊猫:有条件地替换组中连续的行

[英]Pandas: Conditional replace on consecutive rows within a group

I am trying to build "episodes" from a list of transactions organized by group (patient). 我正在尝试根据团体(患者)组织的交易清单构建“片段”。 I used to do this with Stata, but I'm not sure how to do it in Python. 我曾经使用Stata做到这一点,但是我不确定如何在Python中做到这一点。 In Stata, I would say something like: 在Stata中,我会说:

by patient: replace startDate = startDate[_n-1] if startDate-endDate[_n-1]<10

In English, that meant to start with the first row of a group and check if the number of days between the startDate of that group and the endDate of the prior group was less than 10. Then, move to the next row and perform the same thing, then the next row... until you'd exhausted all rows. 用英语来说,这意味着从一个组的第一行开始,并检查该组的startDate和上一个组的endDate之间的天数是否少于10。然后,移至下一行并执行相同的操作东西,然后是下一行...直到用尽所有行。

I have been trying to figure out how to do the same thing in Python/Pandas and running into a wall. 我一直在试图弄清楚如何在Python / Pandas中执行相同的操作并碰壁。 I could sort the dataframe by patient and date, then iterate over the entire data frame. 我可以按患者和日期对数据框进行排序,然后遍历整个数据框。 It seems like there should be a better way to do this. 似乎应该有一个更好的方法来执行此操作。

It's important that the script first compare row 2 to row 1 because, when I get to row 3, if the script has replaced the value in row 2, when I get to row 3, I want to use the replaced value, not the original value. 脚本首先将第2行与第1行进行比较非常重要,因为当我到达第3行时,如果脚本替换了第2行中的值,那么当我到达第3行时,我想使用替换后的值,而不是原始值值。

Sample input: 输入样例:

Patient    startDate    endDate  
1          1/1/2016     1/2/2016  
1          1/11/2016    1/12/2016  
1          1/28/2016    1/28/2016  
1          6/15/2016    6/16/2016  
2          3/1/2016     3/1/2016

Sample output: 样本输出:

Patient    startDate    endDate  
1          1/1/2016     1/2/2016  
1          1/1/2016     1/12/2016  
1          1/1/2016     1/28/2016  
1          6/15/2016    6/16/2016  
2          3/1/2016     3/1/2016

I think we need shift + groupby , and bfill + mask is the key 我认为我们需要shift + groupby ,而bfill + mask是关键

df.startDate=pd.to_datetime(df.startDate)
df.endDate=pd.to_datetime(df.endDate)

df.startDate=df.groupby('Patient').apply(lambda x : x.startDate.mask((x.startDate-x.endDate.shift(1)).fillna(0).astype('timedelta64[D]')<10).bfill()).reset_index(level=0,drop=True).fillna(df.startDate)
df
Out[495]: 
   Patient  startDate    endDate
0        1 2016-01-28 2016-01-02
1        1 2016-01-28 2016-01-12
2        1 2016-01-28 2016-01-28
3        1 2016-06-15 2016-06-16
4        2 2016-03-01 2016-03-01

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM