[英]Combining rows which meet a criterion in R
I have a standard data frame where I have individuals doing a certain behavior over a period of time. 我有一个标准的数据框,其中有个人在一段时间内进行某种行为。 When an incident occurs within 50 seconds of the previous incident (Delay <=50) I would like to combine it with the previous incident. 当一个事件在上一个事件的50秒内发生(延迟<= 50)时,我想将其与上一个事件合并。 That is, each incident would have either a Delay of NA (first incident) or Delay >50. 也就是说,每个事件将具有NA的延迟(第一次事件)或Delay> 50。 The Start time would then be the start time of the first incident (either NA or >50) and the End time would be that of the last incident <=50 (see example with data below). 然后,“开始时间”将是第一个事件的开始时间(NA或> 50),“结束”时间将是最后一个事件的开始时间<= 50(请参见下面的示例数据)。 I would also like the sum of X1 within the combined incidents. 我还要合并事件中X1的总和。 Hopefully the data below clarifies exactly what I am looking for. 希望下面的数据能准确说明我在寻找什么。
Original Data: 原始数据:
ID Incident Start End X1 Delay
Person A 1 747 748 735 NA
Person A 2 868 882 384 120
Person A 3 998 999 354 116
Person A 4 1057 1059 382 58
Person A 5 1063 1064 138 4
Person A 6 1077 1078 138 13
Person A 7 1412 1413 384 334
Person B 1 739 740 387 NA
Person B 2 742 743 132 2
Person B 3 760 761 386 17
Person B 4 768 769 731 7
Person B 5 835 835 894 66
Person B 6 838 839 891 3
Person B 7 925 926 385 86
Desired Data: 所需数据:
ID Iteration Start End X1 Delay
Person A 1 747 748 735 NA
Person A 2 868 882 384 120
Person A 3 998 999 354 116
Person A 4 1057 1078 658 58
Person A 5 1412 1413 384 334
Person B 1 739 769 1636 NA
Person B 2 835 839 1785 66
Person B 3 925 926 385 86
I have tried multiple things, the issue is I just can't aggregate by ID because the same person might have two separate incidents. 我已经尝试了多种方法,问题是我无法按ID进行汇总,因为同一个人可能会发生两次单独的事件。
Thanks! 谢谢! and let me know if you need any more information. 并告诉我您是否需要更多信息。
I think you have a mistake in your desired result table. 我认为您想要的结果表中有一个错误。 Line 5 should be Person A. 第5行应为A人。
Here's a way to do that with dplyr
. 这是使用dplyr
此操作的方法。 The rationale is that we first combine indicents using cumsum
. 理由是我们首先使用cumsum
组合cumsum
。 If a delay is > 50 or NA, the incident number is increased by one. 如果延迟大于50或不适用,则事件数将增加一。 Then, we summarise
on this new incident column. 然后,我们在这个新的事件列中进行summarise
。
df%>%
group_by(ID)%>%
mutate(Incident=cumsum(Delay>50|is.na(Delay)))%>%
group_by(ID,Incident)%>%
summarise(Start=first(Start),End=last(End),X1=sum(X1),Delay=first(Delay))
ID Incident Start End X1 Delay
<chr> <int> <int> <int> <int> <int>
1 PersonA 1 747 748 735 NA
2 PersonA 2 868 882 384 120
3 PersonA 3 998 999 354 116
4 PersonA 4 1057 1078 658 58
5 PersonA 5 1412 1413 384 334
6 PersonB 1 739 769 1636 NA
7 PersonB 2 835 839 1785 66
8 PersonB 3 925 926 385 86
Data 数据
df <- read.table(text="ID Incident Start End X1 Delay
PersonA 1 747 748 735 NA
PersonA 2 868 882 384 120
PersonA 3 998 999 354 116
PersonA 4 1057 1059 382 58
PersonA 5 1063 1064 138 4
PersonA 6 1077 1078 138 13
PersonA 7 1412 1413 384 334
PersonB 1 739 740 387 NA
PersonB 2 742 743 132 2
PersonB 3 760 761 386 17
PersonB 4 768 769 731 7
PersonB 5 835 835 894 66
PersonB 6 838 839 891 3
PersonB 7 925 926 385 86",header=TRUE,stringsAsFactors=FALSE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.