简体   繁体   English

根据R中的第一个和最后一个出现折叠观察行

[英]Collapse observation rows based on first and last occurence in R

I have a dataset like this. 我有一个像这样的数据集。

ID        EQP_ID         DATE           ENTRY     EXIT
10        1232           10/01/2018     0058      NA
10        8123           10/01/2018     NA        0059
11        8231           10/02/2018     0063      NA
11        233            10/03/2018     0064      NA
11        2512           10/04/2018     NA        0099
11        2111           10/05/2018     NA        1000

I want to collapse the observations such that the earliest row I see with an 'ENTRY' for a given ID is combined with the latest row with an EXIT value, and I also get the EQP_ID associated with the exit record: 我想折叠观察值,以便将给定ID带有“ ENTRY”的最早行与具有EXIT值的最新行合并,并且我还获得与退出记录关联的EQP_ID:

ID       EQP_ID    ENTRY       EXIT
10       8123      0058        0059
11       2111      0063        1000

I'm fairly new to R and this was complicated enough that I couldn't think of a good way to do it without resorting to a loop, and performance is predictably not very good. 我对R还是很陌生,它非常复杂,以至于我想不出一个不求助于循环的好方法,而且性能也不是很好。

Edit 编辑

I think this does it, but I'd still be curious if other more experienced folks have a better answer 我认为可以,但我仍然想知道其他经验丰富的人是否有更好的答案

> group_by(dataset, ID) %>% 
  arrange(ENTRY) %>% 
  summarize(ENTRY = first(ENTRY), EXIT = last(exit), EQP_ID = last(EQP_ID))

One option with data.table: data.table的一种选择:

library(data.table)

#create example data
dt <- data.table(
    id = c(10, 10, 11, 11, 11, 11),
    date = seq(as.Date("2018-10-1"), as.Date("2018-10-6"), by="day"),
    entry = c(58, NA, 63, 64, NA, NA),
    exit = c(NA, 59, NA, NA, 99, 100)
)

# number rows by id
dt[order(id, date), num := 1:.N, by=id]

# get first-entry and last-exit values by id
dt[ , keepentry := entry[1],by=id]
dt[ , keepexit  := exit[.N],by=id]

# keep one row per id
dt[num==1, .(id, keepentry, keepexit)]

Not my most elegant work, but it will get the job done. 这不是我最出色的工作,但可以完成工作。

Using dplyr::first and dplyr::last we can do the below, another option we can use min and max 使用dplyr::firstdplyr::last我们可以执行以下操作,另一个可以使用minmax选项

library(dplyr)
df %>% group_by(ID) %>% 
       summarise(EQP_ID=dplyr::last(EQP_ID), First=dplyr::first(ENTRY),Last=dplyr::last(EXIT))


 # A tibble: 2 x 4
 ID EQP_ID First  Last
 <int>  <int> <int> <int>
1    10   8123    58    59
2    11   2111    63  1000

This solution uses dplyr . 此解决方案使用dplyr First, define the data frame. 首先,定义数据框。

df <- read.table(text = "ID        EQP_ID         DATE           ENTRY     EXIT
10        1232           10/01/2018     0058      NA
10        8123           10/01/2018     NA        0059
11        8231           10/02/2018     0063      NA
11        233            10/03/2018     0064      NA
11        2512           10/04/2018     NA        0099
11        2111           10/05/2018     NA        1000", header = TRUE)

Next, group by ID and take either the first or last value of variables in the group using head or tail , respectively. 接下来,按ID分组,并分别使用headtail来获取组中变量的第一个或最后一个值。

df %>% 
  group_by(ID) %>% 
  summarise(EQP_ID = tail(EQP_ID, 1),
            ENTRY = head(ENTRY, 1),
            EXIT = tail(EXIT, 1))

This gives, 这样,

# # A tibble: 2 x 4
#       ID EQP_ID ENTRY  EXIT
#    <int>  <int> <int> <int>
# 1    10   8123    58    59
# 2    11   2111    63  1000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM