简体   繁体   中英

Getting cummulative count with plyr in R

I have a data frame with around 70,000 rows and I am trying to get a count dependent on date-time variables> I have been using plyr for my other analysis but this one is just not working. My dataframe is as below:

Create.Date.Time        Service         Closing.Date.Time
1   2013-06-01 12:59:00 AV              2013-06-01 13:59:00
2   2013-06-02 07:56:00 SERVICE684793   2013-06-02 08:59:00
3   2013-06-02 09:39:00 SERVICE684793   2013-06-03 12:01:00
4   2013-06-02 14:14:00 SERVICE684796   2013-06-02 14:55:00
5   2013-06-02 17:20:00 SERVICE684797   2013-06-03 12:06:00
6   2013-06-03 07:20:00 SERVICE684793   2013-06-03 07:39:00
7   2013-06-03 08:02:00 SERVICE684839   2013-06-03 12:09:00
8   2013-06-03 08:04:00 SERVICE684841   2013-06-04 08:05:00
9   2013-06-03 08:04:00 SERVICE684841   2013-06-05 08:06:00
10  2013-06-03 08:08:00 SERVICE684841   2013-06-03 08:08:00

My aim is to obtain the number of observations for each which has been closed by each Create.Date.Time. I do not want to use for loops since that will take forever. I wanted to use plyr, with the function being a count:

count number of observations where

Closing.Date.Time <= Create.Date.Time

for each Create.Date.Time for each Service.

My starting point is ddply (df, .(Service, Create.Date.Time), ...) , but i am having trouble with my function since the values depend on my Create.Date.Time and I do not know how to write that. Could someone help me please?

I want to end up with a data frame like this:

 Service        Create.Date.Time      Num.Closed
  AV            2013-06-01 12:59:00      0
  SERVICE684793 2013-06-02 07:56:00      0
  SERVICE684793 2013-06-02 09:39:00      1
  SERVICE684793 2013-06-03 07:20:00      1
  SERVICE684796 2013-06-02 14:14:00      0
  SERVICE684797 2013-06-02 17:20:00      0
  SERVICE684839 2013-06-03 08:02:00      0
  SERVICE684841 2013-06-03 08:04:00      0
  SERVICE684841 2013-06-03 08:04:00      0
  SERVICE684841 2013-06-03 08:08:00      3

I'm not really sure how the data.frame you want to end up with relates to the question you asked since the results. aren't the one you describe. Could you perhaps write the loop that you would use if there is no other alternative?

If you want (as you wrote) the:

count number of observations where

Closing.Date.Time <= Create.Date.Time

for each Create.Date.Time for each Service , then a good way to go would be to use the data.table package. In that case, your data is:

       Create.Date.Time       Service   Closing.Date.Time
 1: 2013-06-01 12:59:00            AV 2013-06-01 13:59:00
 2: 2013-06-02 07:56:00 SERVICE684793 2013-06-02 08:59:00
 3: 2013-06-02 09:39:00 SERVICE684793 2013-06-03 12:01:00
 4: 2013-06-02 14:14:00 SERVICE684796 2013-06-02 14:55:00
 5: 2013-06-02 17:20:00 SERVICE684797 2013-06-03 12:06:00
 6: 2013-06-03 07:20:00 SERVICE684793 2013-06-03 07:39:00
 7: 2013-06-03 08:02:00 SERVICE684839 2013-06-03 12:09:00
 8: 2013-06-03 08:04:00 SERVICE684841 2013-06-04 08:05:00
 9: 2013-06-03 08:04:00 SERVICE684841 2013-06-05 08:06:00
10: 2013-06-03 08:08:00 SERVICE684841 2013-06-03 08:08:00

where the dates and times are POSIXct format.

Then:

dt[, sum(Closing.Date.Time <= Create.Date.Time ), by = c('Service', 'Create.Date.Time')]

would result in

         Service    Create.Date.Time V1
1:            AV 2013-06-01 12:59:00  0
2: SERVICE684793 2013-06-02 07:56:00  0
3: SERVICE684793 2013-06-02 09:39:00  0
4: SERVICE684796 2013-06-02 14:14:00  0
5: SERVICE684797 2013-06-02 17:20:00  0
6: SERVICE684793 2013-06-03 07:20:00  0
7: SERVICE684839 2013-06-03 08:02:00  0
8: SERVICE684841 2013-06-03 08:04:00  0
9: SERVICE684841 2013-06-03 08:08:00  1

Which is what you described.

Cheers.

I didn't fully understand the problem as there is one instance where the expected output shown is different from the output I am getting. If that is just a typo:

data

 df <-   structure(list(Create.Date.Time = structure(c(1370105940, 1370174160, 
 1370180340, 1370196840, 1370208000, 1370258400, 1370260920, 1370261040, 
 1370261040, 1370261280), class = c("POSIXct", "POSIXt"), tzone = ""), 
 Service = c("AV", "SERVICE684793", "SERVICE684793", "SERVICE684796", 
"SERVICE684797", "SERVICE684793", "SERVICE684839", "SERVICE684841", 
"SERVICE684841", "SERVICE684841"), Closing.Date.Time = structure(c(1370109540, 
1370177940, 1370275260, 1370199300, 1370275560, 1370259540, 
1370275740, 1370347500, 1370433960, 1370261280), class = c("POSIXct", 
"POSIXt"), tzone = "")), .Names = c("Create.Date.Time", "Service", 
"Closing.Date.Time"), row.names = c("1", "2", "3", "4", "5", 
"6", "7", "8", "9", "10"), class = "data.frame")

Extract the time from the POSIXct class

library(lubridate)

dfNew <- within(df, {
            Createtime <- period_to_seconds(hms(strftime(Create.Date.Time, "%H:%M:%S")))
         Closingtime <- period_to_seconds(hms(strftime(Closing.Date.Time, "%H:%M:%S")))})

dfNew <- dfNew[order(dfNew$Service),] #not that necessary

Using data.table

library(data.table)
setDT(dfNew)[,Num.Closed := cumsum(unlist(lapply(1:.N, function(i) sum(Closingtime[1:i] <=Createtime[i])))),
   by=Service][,c(2,1,6), with=FALSE] 
#              Service    Create.Date.Time Num.Closed
 #1:            AV 2013-06-01 12:59:00          0
 #2: SERVICE684793 2013-06-02 07:56:00          0
 #3: SERVICE684793 2013-06-02 09:39:00          1
 #4: SERVICE684793 2013-06-03 07:20:00          1
 #5: SERVICE684796 2013-06-02 14:14:00          0
 #6: SERVICE684797 2013-06-02 17:20:00          1
 #7: SERVICE684839 2013-06-03 08:02:00          0
 #8: SERVICE684841 2013-06-03 08:04:00          0
 #9: SERVICE684841 2013-06-03 08:04:00          0
#10: SERVICE684841 2013-06-03 08:08:00          3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM