[英]R: Group by one column, and return the first row that has a value greater than 0 in any of the other columns and then return all rows after this row
I'm new to R programming and hope someone could help me with the situation below: 我是R编程的新手,希望有人可以帮助我解决以下情况:
I have a dataframe shown in the picture (Original Dataframe), I would like to return the first record grouped by the [ID] column that has a value >= 1 in any of the four columns (A, B, C, or D) and all the records after based off the [Date] column (the desired dataframe should look like the Output Dataframe shown in the picture). 我有一个显示在图片中的数据框(原始数据框),我想返回由[ID]列分组的第一条记录,该记录在四个列(A,B,C或D中的任何一个中,值> = 1) )以及基于[日期]列的所有记录(所需的数据框应类似于图片所示的输出数据框)。 Basically, remove all the records highlighted in yellow. 基本上,删除所有以黄色突出显示的记录。 I would appreciate greatly if you can provide the R code to achieve this. 如果可以提供R代码来实现此目标,我将不胜感激。
structure(list(ID = c(101L, 101L, 101L, 101L, 101L, 101L, 103L,
103L, 103L, 103L), Date = c(43338L, 43306L, 43232L, 43268L, 43183L,
43144L, 43310L, 43246L, 43264L, 43209L), A = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L), B = c(0L, 2L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 0L), C = c(0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), D = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("ID", "Date",
"A", "B", "C", "D"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
Here is a solution, 这是一个解决方案,
ID Date A B C D
1 101 26.08.2018 0 0 0 0
2 101 25.07.2018 0 2 0 0
3 101 12.05.2018 0 0 1 0
4 101 17.06.2018 0 0 0 0
5 101 24.03.2018 0 0 0 0
6 101 13.02.2018 0 0 0 0
7 103 29.07.2018 0 0 0 0
8 103 26.05.2018 1 1 0 0
9 103 13.06.2018 0 0 0 0
10 103 19.04.2018 0 0 0 0
data$Check <- rowSums(data[3:6])
data$Date <- as.Date(data$Date , "%d.%m.%Y")
data <- data[order(data$ID,data$Date),]
id <- unique(data$ID)
for(i in 1:length(id)) {
data_sample <- data[data$ID == id[i],]
data_sample <- data_sample[ min(which(data_sample$Check>0 )):nrow(data_sample),]
if(i==1) {
final <- data_sample
} else {
final <- rbind(final,data_sample)
}
}
final <- final[,-7]
ID Date A B C D
3 101 2018-05-12 0 0 1 0
4 101 2018-06-17 0 0 0 0
2 101 2018-07-25 0 2 0 0
1 101 2018-08-26 0 0 0 0
8 103 2018-05-26 1 1 0 0
9 103 2018-06-13 0 0 0 0
7 103 2018-07-29 0 0 0 0
Here's a tidyverse
solution. 这是一个tidyverse
解决方案。 The filter
condition deserves some explanation: filter
条件值得一些解释:
ID
and Date
and group_by ID
首先,我们按ID
和Date
以及group_by ID
排序 > 0
测试每一行是否有任何变量> 0
Date
for that row. 获取该行的Date
值。 Date
is >=
than this. 然后,其中过滤行Date
是>=
比这个。 Since we're still grouping by ID
, all these calculations will happen separately for each group: 由于我们仍按ID
分组,因此所有这些计算将分别针对每个组进行:
df %>%
arrange(ID, Date) %>%
group_by(ID) %>%
filter(Date >= Date[min(which(A > 0 | B > 0 | C > 0 | D > 0))])
# A tibble: 7 x 6
# Groups: ID [2]
ID Date A B C D
<int> <int> <int> <int> <int> <int>
1 101 43232 0 0 1 0
2 101 43268 0 0 0 0
3 101 43306 0 2 0 0
4 101 43338 0 0 0 0
5 103 43246 1 1 0 0
6 103 43264 0 0 0 0
7 103 43310 0 0 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.