简体   繁体   English

在 data.table 上使用 cut 函数获取日期

[英]Using cut function on data.table for dates

I'm trying to figure out the best way to achieve my goal.我正在努力找出实现目标的最佳方式。 I have one large master data.table (>110,000 observations) that contains data from photos taken by several camera stations.我有一个大型主数据表(> 110,000 个观测值),其中包含来自多个摄像站拍摄的照片的数据。 I have a separate data.table that holds information about these cameras, like when photos were uploaded from them (photos were uploaded from each camera multiple times).我有一个单独的 data.table 保存有关这些相机的信息,例如从它们上传照片的时间(从每个相机多次上传照片)。 For each camera, I need to subset the photos taken by it into "bins" that are defined by when the photos were uploaded.对于每台相机,我需要将它拍摄的照片子集到由照片上传时间定义的“箱”中。 I assume that I will need to use a for loop to run through each camera but from there I get stuck.我假设我需要使用for循环来遍历每个相机,但从那里我卡住了。 I feel like I am looking for a version of the cut function that can return what "bin" the photo belongs in as a separate column in the data.table.我觉得我正在寻找一个版本的 cut 函数,它可以将照片所属的“bin”作为 data.table 中的一个单独列返回。

Example data.table of the photos from one camera:来自一台相机的照片的示例数据表:

> station_photos
 year_unit_station    Photo_Number   Creation_Datetime bin_name
1:       2016_275_02 275_02_0017.JPG 2016-09-23 11:51:03         
2:       2016_275_02 275_02_0035.JPG 2016-09-27 15:58:21         
3:       2016_275_02 275_02_0036.JPG 2016-09-27 15:58:49         
4:       2016_275_02 275_02_0037.JPG 2016-09-27 16:00:04         
5:       2016_275_02 275_02_0038.JPG 2016-09-27 16:00:59         
6:       2016_275_02 275_02_0039.JPG 2016-09-27 16:01:27         
7:       2016_275_02 275_02_0062.JPG 2016-10-02 12:22:35         
> 

Example of the table that shows when the photos were uploaded:显示照片上传时间的表格示例:

> station_bins
   year_unit_station    service_end_dttm      bin_name
1:       2016_275_02 2016-09-23 11:21:00 2016_275_02_1
2:       2016_275_02 2016-09-30 10:45:00 2016_275_02_2
3:       2016_275_02 2016-10-07 08:31:00 2016_275_02_3

End goal table for each camera that I am hoping to get from my code:我希望从我的代码中获得的每个相机的最终目标表:

> station_photos
 year_unit_station    Photo_Number   Creation_Datetime bin_name
1:       2016_275_02 275_02_0017.JPG 2016-09-23 11:51:03 2016_275_02_1        
2:       2016_275_02 275_02_0035.JPG 2016-09-27 15:58:21 2016_275_02_1        
3:       2016_275_02 275_02_0036.JPG 2016-09-27 15:58:49 2016_275_02_1        
4:       2016_275_02 275_02_0037.JPG 2016-09-27 16:00:04 2016_275_02_1       
5:       2016_275_02 275_02_0038.JPG 2016-09-27 16:00:59 2016_275_02_1        
6:       2016_275_02 275_02_0039.JPG 2016-09-27 16:01:27 2016_275_02_1        
7:       2016_275_02 275_02_0062.JPG 2016-10-02 12:22:35 2016_275_02_2
8:       2016_275_02 275_02_0075.JPG 2016-10-31 03:09:43 2016_275_02_3        
> 

I've considered using cut() or subset but I am not sure how to get either to fill in that last variable of the "bin_name" for me rather than just returning a list or data.frame.我已经考虑过使用cut()subset但我不确定如何为我填写“bin_name”的最后一个变量,而不是仅仅返回一个列表或 data.frame。 My other concern is that not every camera will have 3 bins, some will have 2 some will have 4. And to add one more twist to this how could I use a similar or the same method to create bins that are a set length rather than from a date range.我的另一个担忧是,并非每个相机都会有 3 个垃圾箱,有些会有 2 个,有些会有 4 个。为了再增加一个扭曲,我如何使用类似或相同的方法来创建固定长度的垃圾箱,而不是从一个日期范围。 The end goal is to count how many photos were taken by the camera between uploads as well as to count the number of photos taken by each camera in 10 minute intervals.最终目标是计算相机在上传之间拍摄了多少张照片,以及计算每台相机在 10 分钟间隔内拍摄的照片数量。 It would be very helpful to still have that bin_name column for future analysis.保留 bin_name 列以供将来分析将非常有帮助。

I'm not really sure if my explanation makes sense and it is quite possible that I am making the solution way more complicated than I need to.我不确定我的解释是否有意义,而且很可能我使解决方案变得比我需要的更复杂。 Thank you in advance for any help or insight you can give!预先感谢您提供的任何帮助或见解!

Perhaps, I have misunderstood the whole question but I am wondering why the timestamp of the upload is before the timestamps of the photos in the bin, eg, for bin 2016_275_02_1 the service_end_dttm was 2016-09-23 11:21:00 but the first photo in that bin was taken half an hour later at Creation_Datetime 2016-09-23 11:51:03.也许,我误解了整个问题,但我想知道为什么上传的时间戳在 bin 中照片的时间戳之前,例如,对于 bin 2016_275_02_1service_end_dttm是 2016-09-23 11:21:00 但第一个该垃圾箱中的照片是半小时后在Creation_Datetime 2016-09-23 11:51:03 拍摄的。

However, OP's expected result can be reproduced by a rolling join and an update by reference with .但是,OP 的预期结果可以通过滚动连接通过引用更新来

library(data.table)
station_photos[, bin_name := 
                 station_bins[station_photos, 
                              on = c("year_unit_station", "service_end_dttm" = "Creation_Datetime"), 
                              roll = Inf, bin_name]][]
 year_unit_station Photo_Number Creation_Datetime bin_name 1: 2016_275_02 275_02_0017.JPG 2016-09-23 11:51:03 2016_275_02_1 2: 2016_275_02 275_02_0035.JPG 2016-09-27 15:58:21 2016_275_02_1 3: 2016_275_02 275_02_0036.JPG 2016-09-27 15:58:49 2016_275_02_1 4: 2016_275_02 275_02_0037.JPG 2016-09-27 16:00:04 2016_275_02_1 5: 2016_275_02 275_02_0038.JPG 2016-09-27 16:00:59 2016_275_02_1 6: 2016_275_02 275_02_0039.JPG 2016-09-27 16:01:27 2016_275_02_1 7: 2016_275_02 275_02_0062.JPG 2016-10-02 12:22:35 2016_275_02_2 8: 2016_275_02 275_02_0075.JPG 2016-10-31 03:09:43 2016_275_02_3

Data数据

It was a bit tricky to convert the printed sample data into data.table objects to work with:将打印的样本数据转换为 data.table 对象来使用有点棘手:

library(data.table)
library(magrittr)

station_photos <- "
   year_unit_station    Photo_Number   Creation_Datetime bin_name
1:       2016_275_02 275_02_0017.JPG 2016-09-23 11:51:03
2:       2016_275_02 275_02_0035.JPG 2016-09-27 15:58:21
3:       2016_275_02 275_02_0036.JPG 2016-09-27 15:58:49
4:       2016_275_02 275_02_0037.JPG 2016-09-27 16:00:04
5:       2016_275_02 275_02_0038.JPG 2016-09-27 16:00:59
6:       2016_275_02 275_02_0039.JPG 2016-09-27 16:01:27
7:       2016_275_02 275_02_0062.JPG 2016-10-02 12:22:35
8:       2016_275_02 275_02_0075.JPG 2016-10-31 03:09:43" %>% 
  readr::read_fwf(col_types = "-ccc") %>% 
  setDT() %>% 
  setnames(.[1, unlist(.SD)]) %>% 
  .[-1] %>% 
  .[, Creation_Datetime := anytime::anytime(Creation_Datetime)]

station_bins <- "
   year_unit_station    service_end_dttm      bin_name
1:       2016_275_02 2016-09-23 11:21:00 2016_275_02_1
2:       2016_275_02 2016-09-30 10:45:00 2016_275_02_2
3:       2016_275_02 2016-10-07 08:31:00 2016_275_02_3" %>% 
  readr::read_fwf(col_types = "-ccc") %>% 
  setDT() %>% 
  setnames(.[1, unlist(.SD)]) %>% 
  .[-1] %>% 
  .[, service_end_dttm := anytime::anytime(service_end_dttm)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM