简体   繁体   English

使用 dplyr 从指定的开始和停止时间对时间序列数据进行子集化

[英]Use dplyr to subset time series data from specified start and stop times

I have a time series of water quality observations every 15 minutes for several months.我有几个月每 15 分钟一次的水质观察时间序列。 I would like to use a dplyr approach to excise certain time periods of this data as specified by start/stop times in a separate table.我想使用 dplyr 方法来删除此数据的某些时间段,如单独表中的开始/停止时间所指定。 This would be a vast improvement from just manually deleting observations in the original spreadsheet.与仅手动删除原始电子表格中的观察结果相比,这将是一个巨大的改进。

I have attempted several approaches.我尝试了几种方法。 The closest attempt so far is to join the two tables, mutate an "excise" column that notes if original observations fall between the specified start/stop times, then filter out those specified observations.到目前为止最接近的尝试是连接两个表,改变一个“消费税”列,如果原始观察值落在指定的开始/停止时间之间,然后过滤掉那些指定的观察值。

However, this approach does not excise observations between my specified start/stop times.但是,这种方法不会删除我指定的开始/停止时间之间的观察结果。 The initial left_join function creates additional rows for reasons I do not understand, and the observations I wish to excise remain present.最初的 left_join function 出于我不理解的原因创建了额外的行,并且我希望删除的观察结果仍然存在。

Is there an additional step needed in my pipeline, or some other entirely different approach to perform this task?我的管道中是否需要额外的步骤,或者其他一些完全不同的方法来执行此任务?


# require packages
library(googlesheets)
library(tidyverse)

# import original data csv
hydrolab_data <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRjW5XkphY_Dpxv-GvzSAEFf5_21cP13na5K8L_ubl0yD6KwtkmknBI46WAK46YOXYiFYyaknb5WeGz/pub?gid=1104985471&single=true&output=csv")

# import time periods to be excised csv
excise_data <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRjW5XkphY_Dpxv-GvzSAEFf5_21cP13na5K8L_ubl0yD6KwtkmknBI46WAK46YOXYiFYyaknb5WeGz/pub?gid=0&single=true&output=csv")

reduced_dataset <- hydrolabs %>%
  left_join(excise_data, by = c("SiteID","Parameter")) %>%
# remove observations based on specified start/stop times 
  mutate(excise = case_when(DateTime > DateTime_Start &
                              DateTime < DateTime_End |
                              DateTime == DateTime_End |
                              DateTime == DateTime_Start ~ "Y")) %>%
  filter(is.na(excise))

# "hydrolabs" is 34636 rows while "reduced_dataset" is 40615 rows.  Why are extra rows being created?


Session Info:
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readxl_1.3.1       anytime_0.3.9      lubridate_1.7.9    janitor_2.0.1     
 [5] hms_0.5.3          forcats_0.5.0      stringr_1.4.0      dplyr_1.0.2       
 [9] purrr_0.3.4        readr_1.4.0        tidyr_1.1.2        tibble_3.0.3      
[13] ggplot2_3.3.2      tidyverse_1.3.0    googlesheets_0.3.0

An alternative to dplyr is to use data.table because it allows non equi joins which are very useful to query date ranges: dplyr的替代方法是使用data.table ,因为它允许非相等连接,这对于查询日期范围非常有用:

library(data.table)
setDT(hydrolab_data)
setDT(excise_data)

# Convert to POSIXct
hydrolab_data[,DateTimeNum :=as.POSIXct(DateTime,format='%m/%d/%y %H:%M',tz='UTC') ]
excise_data[,c("DateTime_StartNum","DateTime_EndNum"):=.(as.POSIXct(excise_data$DateTime_Start,tz='UTC'),
                                                         as.POSIXct(excise_data$DateTime_End,tz='UTC'))]

excise_data[hydrolab_data, .(SiteID,
                             DateTime_Start,
                             DateTime_End,
                             Parameter,
                             i.Parameter,
                             DateTime,
                             value
                            ),
            on = .(SiteID=SiteID,DateTime_StartNum <= DateTimeNum,DateTime_EndNum >= DateTimeNum),nomatch=0]

         SiteID     DateTime_Start        DateTime_End Parameter i.Parameter      DateTime  value
  1: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00    TurbSC        Temp  7/29/20 8:00  21.41
  2: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00    TurbSC        Temp  7/29/20 8:15  21.51
  3: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00    TurbSC        Temp  7/29/20 8:30  21.62
  4: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00    TurbSC        Temp  7/29/20 8:45  21.73
  5: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00    TurbSC        Temp  7/29/20 9:00  21.82
 ---                                                                                              
352: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00    TurbSC        LDO.  7/30/20 5:00 105.90
353: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00    TurbSC        LDO.  7/30/20 5:15 105.80
354: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00    TurbSC        LDO.  7/30/20 5:30 105.70
355: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00    TurbSC        LDO. 8/17/20 12:00 119.10
356: FirstBridge 2020-07-29 0:08:00 2020-08-17 12:15:00    TurbSC        LDO. 8/17/20 12:15 118.90

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM