简体   繁体   中英

aggregate by ID and find min() and max() of time

I got a transaction database which looks like this:

   AccountID PaymentDate PaymentAmount 
8         13  2020-02-09          1.00 
9         13  2020-01-25          4.20   
10        14  2020-01-01         30.68 
11        14  2020-02-01         30.68 

PaymentDate is in posix format. With transaction data I don't want to aggregate by time intervals (this is well documented), but by ID.

Using min() with Posix times gives the first day, max() the last day. This is the information I need for every ID.

Ok, here is what I tried:

# 1.
summaryBy(PaymentDate ~ AccountID, data1, FUN=c(min) )
Fehler in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { : arguments must have same length

# 2.
ddply( data1, "AccountID", summarise, min(PaymentDate))
# returns 0 and warnings:
50: In output[[var]][rng] <- df[[var]] : Anzahl der zu ersetzenden Elemente ist kein Vielfaches der Ersetzungslänge

# 3.
aggregate(PaymentDate ~ AccountID, data1, min)
Fehler in model.frame.default(formula = PaymentDate ~ AccountID, data = data1) : ungültiger Typ (list) für die Variable 'PaymentDate'

Apparently, aggregate does not work with posix time if you need aggregates of time rather than aggregates by time.

But getting first and last transaction date must be possible?!

Reply

Ok, since I cannot answer my own question yet, I post it here:

Interesting. Thank you!

I typically use the as.is=T option with read.csv, then convert time using strptime. So when I look at the structure of my data I get:

$ PaymentDate    : POSIXlt, format: "2020-02-04" "2020-02-04" "2020-02-04" ...

To me, that does not look like a factor. I can use min() and max() on the whole column and it works. Apparently, POSIXlt is more troublesome than I thought. Coming from the POSIXlt, I did

data$PaymentDate=as.Date(data$PaymentDate)

Looking at the structure, the Class is correctly set as Date.

$ PaymentDate    :Class 'Date'  num [1:10000] 18296 18296 18296 18297 18297 ...

Now it appears to work. However, only ddply return the right format "2020-01-25", while both aggregate and summaryBy return in the "18286" format. Was that days since 1970-01-01? Well, I guess I can convert it back.

foo=aggregate(PaymentDate ~ AccountID, data1, min)
as.Date(foo$PaymentDate,origin="1970-01-01")

However, there has to be some explanation. Also, ddply is much slower.

Oh, why do I use strptime first? Well, the date in the original file is in a different format, "%d-%m-%y". Using as.Date directly on this did not appear to work.

EDIT

dput of my data

structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L, 
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
    mday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 9L, 25L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
    2L, 2L, 2L, 3L, 4L, 4L, 17L, 8L, 17L, 28L, 8L, 22L, 3L), 
    mon = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 2L, 3L, 
    3L, 5L, 6L, 6L, 7L, 8L, 8L, 9L, 9L, 11L, 11L, 1L, 2L, 5L, 
    7L, 10L, 10L, 4L, 0L, 4L, 6L, 3L, 2L, 11L, 11L, 4L, 10L), 
    year = c(110L, 110L, 110L, 110L, 110L, 110L, 110L, 110L, 
    110L, 109L, 110L, 110L, 109L, 110L, 110L, 109L, 110L, 109L, 
    109L, 110L, 109L, 110L, 109L, 110L, 109L, 109L, 109L, 110L, 
    109L, 110L, 110L, 110L, 109L, 109L, 110L, 109L, 109L, 110L, 
    109L, 109L), wday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 2L, 1L, 
    4L, 1L, 1L, 3L, 4L, 2L, 3L, 4L, 6L, 2L, 3L, 4L, 5L, 2L, 3L, 
    1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 5L, 4L, 2L, 1L, 3L, 5L, 
    2L), yday = c(34L, 34L, 34L, 35L, 35L, 35L, 35L, 39L, 24L, 
    0L, 31L, 59L, 90L, 90L, 151L, 181L, 181L, 212L, 243L, 243L, 
    273L, 273L, 334L, 334L, 32L, 60L, 152L, 213L, 305L, 305L, 
    122L, 3L, 123L, 197L, 97L, 75L, 361L, 341L, 141L, 306L), 
    isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT")), .Names = c("AccountID", 
"PaymentDate"), row.names = c(NA, 40L), class = "data.frame")

dput after doing as you suggested:

structure(list(AccountID = c(359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 
997L, 997L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L, 
1181L, 1181L, 1181L), PaymentDate = structure(c(14245, 14277, 
14305, 14335, 14368, 14397, 14426, 14457, 14488, 14518, 14550, 
14579, 14613, 14641, 14669, 14700, 14732, 14761, 14791, 14823, 
14853, 14883, 14915, 14944, 14442, 14320, 14606, 14707, 14386, 
14951, 14293, 14432, 14477, 14540, 14540, 14540, 14540, 14540, 
14540, 14551), class = "Date")), .Names = c("AccountID", "PaymentDate"
), row.names = c(10L, 25L, 26L, 13L, 33L, 27L, 16L, 18L, 19L, 
21L, 29L, 23L, 32L, 11L, 12L, 14L, 31L, 15L, 17L, 28L, 20L, 22L, 
30L, 24L, 34L, 36L, 37L, 35L, 39L, 38L, 45L, 42L, 48L, 50L, 51L, 
52L, 53L, 54L, 55L, 40L), class = "data.frame")

dput of raw data

structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L, 
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = c("04-02-2010", "04-02-2010", "04-02-2010", 
"05-02-2010", "05-02-2010", "05-02-2010", "05-02-2010", "09-02-2010", 
"25-01-2010", "01-01-2009", "01-02-2010", "01-03-2010", "01-04-2009", 
"01-04-2010", "01-06-2010", "01-07-2009", "01-07-2010", "01-08-2009", 
"01-09-2009", "01-09-2010", "01-10-2009", "01-10-2010", "01-12-2009", 
"01-12-2010", "02-02-2009", "02-03-2009", "02-06-2009", "02-08-2010", 
"02-11-2009", "02-11-2010", "03-05-2010", "04-01-2010", "04-05-2009", 
"17-07-2009", "08-04-2010", "17-03-2009", "28-12-2009", "08-12-2010", 
"22-05-2009", "03-11-2009")), .Names = c("AccountID", "PaymentDate"
), row.names = c(NA, 40L), class = "data.frame")

The problem is that your data, and in particular the PaymentDate column, is a factor. Both your ddply and aggregate solutions will work as written if you first transform the PaymentDate column:

#Recreate data and use dput() to replicate
df <- structure(list(AccountID = c(13L, 13L, 14L, 14L), PaymentDate = c("2020-02-09", 
            "2020-01-25", "2020-01-01", "2020-02-01"), PaymentAmount = c(1, 
            4.2, 30.68, 30.68)), .Names = c("AccountID", "PaymentDate", "PaymentAmount"
    ), class = "data.frame", row.names = c("8", "9", "10", "11"))

Change variable class to Date.

df$PaymentDate <- as.Date(df$PaymentDate)

Then run your original code. Using ddply:

ddply(df, .(AccountID), summarize, data=min(PaymentDate))
  AccountID       data
1        13 2020-01-25
2        14 2020-01-01

Using aggregate:

aggregate(PaymentDate ~ AccountID, df, min)
  AccountID PaymentDate
1        13  2020-01-25
2        14  2020-01-01

There is another, more general way to avoid this problem. By default, when you create a data.frame by using read.table (or its variants like read.csv ) the parameter stringsAsFactors is set to TRUE . When I recreate your data using stringsAsFactors=FALSE , you don't need the intermediate step of transforming PaymentDate, and your code works as written:

dat <- "   AccountID PaymentDate PaymentAmount 
8         13  2020-02-09          1.00 
9         13  2020-01-25          4.20   
10        14  2020-01-01         30.68 
11        14  2020-02-01         30.68 "

df <- read.table(textConnection(dat), stringsAsFactors=FALSE)
df

ddply(df, .(AccountID), summarize, data=min(PaymentDate))
  AccountID       data
1        13 2020-01-25
2        14 2020-01-01

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM