[英]Use dplyr to subset time series data from specified start and stop times
I have a time series of water quality observations every 15 minutes for several months.我有几个月每 15 分钟一次的水质观察时间序列。 I would like to use a dplyr approach to excise certain time periods of this data as specified by start/stop times in a separate table.
我想使用 dplyr 方法来删除此数据的某些时间段,如单独表中的开始/停止时间所指定。 This would be a vast improvement from just manually deleting observations in the original spreadsheet.
与仅手动删除原始电子表格中的观察结果相比,这将是一个巨大的改进。
I have attempted several approaches.我尝试了几种方法。 The closest attempt so far is to join the two tables, mutate an "excise" column that notes if original observations fall between the specified start/stop times, then filter out those specified observations.
到目前为止最接近的尝试是连接两个表,改变一个“消费税”列,如果原始观察值落在指定的开始/停止时间之间,然后过滤掉那些指定的观察值。
However, this approach does not excise observations between my specified start/stop times.但是,这种方法不会删除我指定的开始/停止时间之间的观察结果。 The initial left_join function creates additional rows for reasons I do not understand, and the observations I wish to excise remain present.
最初的 left_join function 出于我不理解的原因创建了额外的行,并且我希望删除的观察结果仍然存在。
Is there an additional step needed in my pipeline, or some other entirely different approach to perform this task?我的管道中是否需要额外的步骤,或者其他一些完全不同的方法来执行此任务?
# require packages
library(googlesheets)
library(tidyverse)
# import original data csv
hydrolab_data <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRjW5XkphY_Dpxv-GvzSAEFf5_21cP13na5K8L_ubl0yD6KwtkmknBI46WAK46YOXYiFYyaknb5WeGz/pub?gid=1104985471&single=true&output=csv")
# import time periods to be excised csv
excise_data <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRjW5XkphY_Dpxv-GvzSAEFf5_21cP13na5K8L_ubl0yD6KwtkmknBI46WAK46YOXYiFYyaknb5WeGz/pub?gid=0&single=true&output=csv")
reduced_dataset <- hydrolabs %>%
left_join(excise_data, by = c("SiteID","Parameter")) %>%
# remove observations based on specified start/stop times
mutate(excise = case_when(DateTime > DateTime_Start &
DateTime < DateTime_End |
DateTime == DateTime_End |
DateTime == DateTime_Start ~ "Y")) %>%
filter(is.na(excise))
# "hydrolabs" is 34636 rows while "reduced_dataset" is 40615 rows. Why are extra rows being created?
Session Info:
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readxl_1.3.1 anytime_0.3.9 lubridate_1.7.9 janitor_2.0.1
[5] hms_0.5.3 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2
[9] purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.3
[13] ggplot2_3.3.2 tidyverse_1.3.0 googlesheets_0.3.0
An alternative to dplyr
is to use data.table
because it allows non equi joins which are very useful to query date ranges: dplyr
的替代方法是使用data.table
,因为它允许非相等连接,这对于查询日期范围非常有用:
library(data.table)
setDT(hydrolab_data)
setDT(excise_data)
# Convert to POSIXct
hydrolab_data[,DateTimeNum :=as.POSIXct(DateTime,format='%m/%d/%y %H:%M',tz='UTC') ]
excise_data[,c("DateTime_StartNum","DateTime_EndNum"):=.(as.POSIXct(excise_data$DateTime_Start,tz='UTC'),
as.POSIXct(excise_data$DateTime_End,tz='UTC'))]
excise_data[hydrolab_data, .(SiteID,
DateTime_Start,
DateTime_End,
Parameter,
i.Parameter,
DateTime,
value
),
on = .(SiteID=SiteID,DateTime_StartNum <= DateTimeNum,DateTime_EndNum >= DateTimeNum),nomatch=0]
SiteID DateTime_Start DateTime_End Parameter i.Parameter DateTime value
1: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00 TurbSC Temp 7/29/20 8:00 21.41
2: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00 TurbSC Temp 7/29/20 8:15 21.51
3: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00 TurbSC Temp 7/29/20 8:30 21.62
4: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00 TurbSC Temp 7/29/20 8:45 21.73
5: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00 TurbSC Temp 7/29/20 9:00 21.82
---
352: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00 TurbSC LDO. 7/30/20 5:00 105.90
353: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00 TurbSC LDO. 7/30/20 5:15 105.80
354: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00 TurbSC LDO. 7/30/20 5:30 105.70
355: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00 TurbSC LDO. 8/17/20 12:00 119.10
356: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00 TurbSC LDO. 8/17/20 12:15 118.90
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.