简体   繁体   English

根据实际未发生更新但保留第一个实例的日期和时间删除行

[英]Remove rows based on date and time where no update actually occurs but keep the first instance

I've come up against a wall in trying to resolve this and hope somebody can help.我在试图解决这个问题时遇到了困难,希望有人能提供帮助。 I'm trying to implement a way to filter this dataset which reflects bike station occupancy data that is time stamped.我正在尝试实现一种过滤此数据集的方法,该数据集反映了带有时间戳的自行车站占用数据。

   ID  Time                   Bike.Availability
1  2   01/04/2020  04:31:16   11
2  2   01/04/2020  04:40:07   11
3  2   01/04/2020  04:50:15   10
4  2   01/04/2020  04:57:10   10
5  2   01/04/2020  05:07:19    9
6  2   01/04/2020  05:19:38   10
7  2   01/04/2020  05:29:47   10
8  2   01/04/2020  06:43:54   11

I want to remove the rows where there is no change in Bike.Availability and only keep the first instance.我想删除 Bike.Availability 没有变化的行,只保留第一个实例。 I would like the resulting dataset to look as follows:我希望生成的数据集如下所示:

   ID  Time                   Bike.Availability
1  2   01/04/2020  04:31:16   11
2  2   01/04/2020  04:50:15   10
3  2   01/04/2020  05:07:19    9
4  2   01/04/2020  05:19:38   10
5  2   01/04/2020  06:43:54   11

I've converted the timestamp:我已经转换了时间戳:

bike_data$Time <- as.POSIXct(bike_data$Time,format="%Y-%m-%d %H:%M:%S")

And I've tried different variations of:我尝试了不同的变体:

library(dplyr)
bike_data %>%
 group_by(Time) %>%
 arrange(Bike.Availability) %>%
 top_n(1)

Any help or feedback would be greatly appreciated.任何帮助或反馈将不胜感激。

We group by the 'ID' and run-length-id of 'Bike.Availability' ie it creates a grouping index based on the similarity of adjacent elements of 'Bike.Availability', then slice the first row with slice_head specifying n = 1我们按 'Bike.Availability' 的 'ID' 和 run-length-id 进行分组,即它根据 'Bike.Availability' 的相邻元素的相似性创建一个分组索引,然后使用slice_head指定n = 1对第一行进行slice

library(dplyr)
library(data.table)
bike_data %>%     
 group_by(ID, grp = rleid(Bike.Availability)) %>%
 slice_head(n = 1) %>%
 ungroup %>%
 select(-grp)

-output -输出

# A tibble: 5 x 3
#     ID Time                 Bike.Availability
#  <int> <chr>                            <int>
#1     2 01/04/2020  04:31:16                11
#2     2 01/04/2020  04:50:15                10
#3     2 01/04/2020  05:07:19                 9
#4     2 01/04/2020  05:19:38                10
#5     2 01/04/2020  06:43:54                11

Grouping by 'Time' column would create groups with single observation per group (based on the values showed in 'Time'), thererefore top_n(1) returns the original dataset instead of subsetting按“时间”列分组将创建每个组具有单个观察值的组(基于“时间”中显示的值),因此top_n(1)返回原始数据集而不是子集

data数据

bike_data <- structure(list(ID = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
Time = c("01/04/2020  04:31:16", 
"01/04/2020  04:40:07", "01/04/2020  04:50:15", "01/04/2020  04:57:10", 
"01/04/2020  05:07:19", "01/04/2020  05:19:38", "01/04/2020  05:29:47", 
"01/04/2020  06:43:54"), Bike.Availability = c(11L, 11L, 10L, 
10L, 9L, 10L, 10L, 11L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8"))

A dplyr solution alone.单独的dplyr解决方案。 Checking if row above and below are same ifelse .检查上面和下面的行是否相同ifelse Then NA to 0 and then filter.然后NA0再过滤。

library(dplyr)
bike_data %>% 
  mutate(same = ifelse(Bike.Availability == lag(Bike.Availability), 1, 0)) %>% 
  mutate(same = ifelse(is.na(same), 0, same)) %>% 
  filter(same=="NA" | same==0) %>% 
  select(-same)

Output: Output:

  ID                 Time Bike.Availability
1  2 01/04/2020  04:31:16                11
3  2 01/04/2020  04:50:15                10
5  2 01/04/2020  05:07:19                 9
6  2 01/04/2020  05:19:38                10
8  2 01/04/2020  06:43:54                11

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM