简体   繁体   中英

Using dcast.data.table with date values and aggregation

Trying to figure this out. Suppose you have a data.table:

dt <- data.table (person=c('bob', 'bob', 'bob'), 
                  door=c('front door', 'front door', 'front door'),
                  type=c('timeIn', 'timeIn', 'timeOut'),
                  time=c(
as.POSIXct('2016 12 02 06 05 01', format = '%Y %m %d %H %M %S'),
as.POSIXct('2016 12 02 06 05 02', format = '%Y %m %d %H %M %S'),
as.POSIXct('2016 12 02 06 05 03', format = '%Y %m %d %H %M %S')                     )
)

I want to pivot it to look like this

person        door        timeIn             timeOut

bob           front door  min(<date/time>) max(<date/time>)

I can't seem to get the right syntax for dcast.data.table. I tried

dcast.data.table(
  dt, person + door ~ type, 
  value.var = 'time', 
  fun.aggregate = function(x) ifelse(type == 'timeIn', min(x), max(x))
)

which throws an error:

Aggregating function(s) should take vector inputs and return a single value (length=1).

I also tried:

 dcast.data.table(dt, person + door ~ type, value.var = 'time')

But the result throws away my dates

   person       door timeIn timeOut
1:    bob front door      2       1

Any suggestions would be appreciated. TIA

There are several ways to achieve the desired result using dcast . jazzurro 's solution does the aggregation before reshaping the result. The approaches here use dcast directly but may require some post-processing. We are using jazzurro 's data which are tweaked to obey the UTC time zone and CRAN version 1.10.0 of data.table .

1. Getting ifelse to work

As reported in the Q,

dcast(
  dt, person + door ~ type, 
  value.var = 'time', 
  fun.aggregate = function(x) ifelse(type == 'timeIn', min(x), max(x))
)

returns an error message. The full text of the error message includes the hint to use the fill parameter. Unfortunately, ifelse() doesn't respect the POSIXct class (for details see ?ifelse ) so this needs to be enforced.

With

dcast(
  dt, person + door ~ type, 
  value.var = 'time', 
  fun.aggregate = function(x) 
    lubridate::as_datetime(ifelse(type == 'timeIn', min(x), max(x))),
  fill = 0
)

we do get

#   person       door              timeIn             timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:05

2. Alternative to ifelse

ifelse 's help page suggests

(tmp <- yes; tmp[!test] <- no[!test]; tmp)

as alternative. Following this advice,

dcast(
  dt, person + door ~ type, 
  value.var = 'time', 
  fun.aggregate = function(x) {
    test <- type == "timeIn"; tmp <- min(x); tmp[!test] = max(x)[!test]; tmp
    }
)

returns

#   person       door              timeIn             timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:05

Note that neither the fill parameter nor the coercion to POSIXct is needed.

3. Using enhanced dcast

With the latest versions of dcast.data.table we can provide a list of functions to fun.aggregate :

dcast(dt, person + door ~ type, value.var = 'time', fun = list(min, max))

returns

#   person       door     time_min_timeIn    time_min_timeOut     time_max_timeIn    time_max_timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:03 2016-12-02 07:06:02 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:03 2016-12-02 06:05:02 2016-12-02 06:05:05

We can remove the unwanted columns and rename the others by

dcast(dt, person + door ~ type, value.var = 'time', fun = list(min, max))[
  , .(person, door, timeIn = time_min_timeIn, timeOut = time_max_timeOut)]

which gets us

#   person       door              timeIn             timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:05

Data

As mentioned above, we are using jazzurro 's data

dt <- structure(list(person = c("bob", "bob", "bob", "bob", "ana", 
"ana", "ana", "ana"), door = c("front door", "front door", "front door", 
"front door", "front door", "front door", "front door", "front door"
), type = c("timeIn", "timeIn", "timeOut", "timeOut", "timeIn", 
"timeIn", "timeOut", "timeOut"), time = structure(c(1480658701, 
1480658702, 1480658703, 1480658705, 1480662361, 1480662362, 1480662363, 
1480662365), class = c("POSIXct", "POSIXt"))), .Names = c("person", 
"door", "type", "time"), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

but coerce the time zone to UTC .

With

dt[, time := lubridate::with_tz(time, "UTC")]

we have

dt
#   person       door    type                time
#1:    bob front door  timeIn 2016-12-02 06:05:01
#2:    bob front door  timeIn 2016-12-02 06:05:02
#3:    bob front door timeOut 2016-12-02 06:05:03
#4:    bob front door timeOut 2016-12-02 06:05:05
#5:    ana front door  timeIn 2016-12-02 07:06:01
#6:    ana front door  timeIn 2016-12-02 07:06:02
#7:    ana front door timeOut 2016-12-02 07:06:03
#8:    ana front door timeOut 2016-12-02 07:06:05

independent of local time zone.

This would be one way to achieve your goal. I modified your dt and created the following data set. For each person, I looked for the minimum time of timeIn and the maximum time of timeOut . Then, I applied dcast() to the result.

#   person       door    type                time
#1:    bob front door  timeIn 2016-12-02 06:05:01
#2:    bob front door  timeIn 2016-12-02 06:05:02
#3:    bob front door timeOut 2016-12-02 06:05:03
#4:    bob front door timeOut 2016-12-02 06:05:05
#5:    ana front door  timeIn 2016-12-02 07:06:01
#6:    ana front door  timeIn 2016-12-02 07:06:02
#7:    ana front door timeOut 2016-12-02 07:06:03
#8:    ana front door timeOut 2016-12-02 07:06:05

library(data.table)

dcast(
   dt[, .SD[(type == "timeIn" & time == min(time))|(type == "timeOut" & time == max(time))], by = person],
   person + door ~ type)

#   person       door              timeIn             timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:05

DATA

dt <- structure(list(person = c("bob", "bob", "bob", "bob", "ana", 
"ana", "ana", "ana"), door = c("front door", "front door", "front door", 
"front door", "front door", "front door", "front door", "front door"
), type = c("timeIn", "timeIn", "timeOut", "timeOut", "timeIn", 
"timeIn", "timeOut", "timeOut"), time = structure(c(1480658701, 
1480658702, 1480658703, 1480658705, 1480662361, 1480662362, 1480662363, 
1480662365), class = c("POSIXct", "POSIXt"))), .Names = c("person", 
"door", "type", "time"), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM