[英]Collapse observation rows based on first and last occurence in R
I have a dataset like this. 我有一个像这样的数据集。
ID EQP_ID DATE ENTRY EXIT
10 1232 10/01/2018 0058 NA
10 8123 10/01/2018 NA 0059
11 8231 10/02/2018 0063 NA
11 233 10/03/2018 0064 NA
11 2512 10/04/2018 NA 0099
11 2111 10/05/2018 NA 1000
I want to collapse the observations such that the earliest row I see with an 'ENTRY' for a given ID is combined with the latest row with an EXIT value, and I also get the EQP_ID associated with the exit record: 我想折叠观察值,以便将给定ID带有“ ENTRY”的最早行与具有EXIT值的最新行合并,并且我还获得与退出记录关联的EQP_ID:
ID EQP_ID ENTRY EXIT
10 8123 0058 0059
11 2111 0063 1000
I'm fairly new to R and this was complicated enough that I couldn't think of a good way to do it without resorting to a loop, and performance is predictably not very good. 我对R还是很陌生,它非常复杂,以至于我想不出一个不求助于循环的好方法,而且性能也不是很好。
Edit 编辑
I think this does it, but I'd still be curious if other more experienced folks have a better answer 我认为可以,但我仍然想知道其他经验丰富的人是否有更好的答案
> group_by(dataset, ID) %>%
arrange(ENTRY) %>%
summarize(ENTRY = first(ENTRY), EXIT = last(exit), EQP_ID = last(EQP_ID))
One option with data.table: data.table的一种选择:
library(data.table)
#create example data
dt <- data.table(
id = c(10, 10, 11, 11, 11, 11),
date = seq(as.Date("2018-10-1"), as.Date("2018-10-6"), by="day"),
entry = c(58, NA, 63, 64, NA, NA),
exit = c(NA, 59, NA, NA, 99, 100)
)
# number rows by id
dt[order(id, date), num := 1:.N, by=id]
# get first-entry and last-exit values by id
dt[ , keepentry := entry[1],by=id]
dt[ , keepexit := exit[.N],by=id]
# keep one row per id
dt[num==1, .(id, keepentry, keepexit)]
Not my most elegant work, but it will get the job done. 这不是我最出色的工作,但可以完成工作。
Using dplyr::first
and dplyr::last
we can do the below, another option we can use min
and max
使用
dplyr::first
和dplyr::last
我们可以执行以下操作,另一个可以使用min
和max
选项
library(dplyr)
df %>% group_by(ID) %>%
summarise(EQP_ID=dplyr::last(EQP_ID), First=dplyr::first(ENTRY),Last=dplyr::last(EXIT))
# A tibble: 2 x 4
ID EQP_ID First Last
<int> <int> <int> <int>
1 10 8123 58 59
2 11 2111 63 1000
This solution uses dplyr
. 此解决方案使用
dplyr
。 First, define the data frame. 首先,定义数据框。
df <- read.table(text = "ID EQP_ID DATE ENTRY EXIT
10 1232 10/01/2018 0058 NA
10 8123 10/01/2018 NA 0059
11 8231 10/02/2018 0063 NA
11 233 10/03/2018 0064 NA
11 2512 10/04/2018 NA 0099
11 2111 10/05/2018 NA 1000", header = TRUE)
Next, group by ID
and take either the first or last value of variables in the group using head
or tail
, respectively. 接下来,按
ID
分组,并分别使用head
或tail
来获取组中变量的第一个或最后一个值。
df %>%
group_by(ID) %>%
summarise(EQP_ID = tail(EQP_ID, 1),
ENTRY = head(ENTRY, 1),
EXIT = tail(EXIT, 1))
This gives, 这样,
# # A tibble: 2 x 4
# ID EQP_ID ENTRY EXIT
# <int> <int> <int> <int>
# 1 10 8123 58 59
# 2 11 2111 63 1000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.