简体   繁体   中英

R data.table add column as function of another data.table

I have one data table which contains just a sequence of times. I have another data table containing two columns: start_time and end_time. I want to take the first data table and add a column where the value is the count of all of the rows in the second data table where the time from the first data table fits within the start and end time. Here is my code

start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")

all_dates <- as.data.table(seq(start_date, end_date, "min"))

colnames(all_dates) <- c("Bin")

start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)

Here is what i want to do, but this is wrong and gives an error. What's the right way to write this?

all_dates[, BinCount := input_data[start_times < Bin & end_times > Bin, .N] ]

In the end i should get something like

Bin                   BinCount
2017-01-31 17:00:00   1
2017-01-31 17:01:00   5
...

The problem can be solved very easily using sqldf as it provides easy way to join tables with range checking. Hence one solution could be:

The data from OP:
library(data.table)
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")

all_dates <- as.data.table(seq(start_date, end_date, "min"))

colnames(all_dates) <- c("Bin")

start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)


library(sqldf)

result <- sqldf("SELECT all_dates.bin, count() as BinCount 
                FROM all_dates, input_data
                 WHERE all_dates.bin > input_data.start_times AND 
                 all_dates.bin < input_data.end_times
                 GROUP BY bin" )

result
                    Bin BinCount
1   2017-01-31 17:01:00        1
2   2017-01-31 17:02:00        1
3   2017-01-31 17:03:00        1
4   2017-01-31 17:04:00        1
5   2017-01-31 17:05:00        1
6   2017-01-31 17:06:00        1
...........
...........
497 2017-02-01 01:17:00        6
498 2017-02-01 01:18:00        5
499 2017-02-01 01:19:00        5
500 2017-02-01 01:20:00        4
 [ reached getOption("max.print") -- omitted 460 rows ]

In data.table you're after a range join.

library(data.table)

start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")

all_dates <- as.data.table(seq(start_date, end_date, "min"))

colnames(all_dates) <- c("Bin")

set.seed(123)
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)

## doing the range-join and calculating the number of items per bin in one chained step
input_data[
    all_dates
    , on = .(start_times < Bin, end_times > Bin)
    , nomatch = 0
    , allow.cartesian = T
][, .N, by = start_times]

#             start_times N
# 1:  2017-01-31 17:01:00 1
# 2:  2017-01-31 17:02:00 1
# 3:  2017-01-31 17:03:00 1
# 4:  2017-01-31 17:04:00 1
# 5:  2017-01-31 17:05:00 1
# ---                      
# 956: 2017-02-01 08:56:00 6
# 957: 2017-02-01 08:57:00 4
# 958: 2017-02-01 08:58:00 4
# 959: 2017-02-01 08:59:00 5
# 960: 2017-02-01 09:00:00 5

Note:

  • I've put the all_dates object on the right-hand-side of the join, so the result contains the names of the input_data columns, even though they are your Bins (see this issue for the discussion on this topic)
  • I've used set.seed() , as you're taking samples

Wasn't requested, but here is a compact alternative solution using the tidyverse . Uses lubridate parsers, interval , and %within% , as well as purrr::map_int to generate the desired bin counts.

library(tidyverse)
library(lubridate)
start_date <- ymd_hms(x = "2017-01-31 17:00:00") # lubridate parsers
end_date <- ymd_hms(x = "2017-02-01 09:00:00")

all_dates <- tibble(seq(start_date, end_date, "min")) # tibble swap for data.table

colnames(all_dates) <- c("Bin")

start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- tibble(
  start_times,
  end_times,
  intvl = interval(start_times, end_times) # Add interval column
  )

all_dates %>% # Checks date in Bin and counts intervals it lies within
  mutate(BinCount = map_int(.$Bin, ~ sum(. %within% input_data$intvl)))
# A tibble: 961 x 2
   Bin                 BinCount
   <dttm>                 <int>
 1 2017-01-31 17:00:00        0
 2 2017-01-31 17:01:00        0
 3 2017-01-31 17:02:00        0
 4 2017-01-31 17:03:00        0
 5 2017-01-31 17:04:00        0
 6 2017-01-31 17:05:00        0
 7 2017-01-31 17:06:00        0
 8 2017-01-31 17:07:00        1
 9 2017-01-31 17:08:00        1
10 2017-01-31 17:09:00        1
# ... with 951 more rows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM