简体   繁体   中英

R: How to use dplyr as alternative to aggregate

I have a dataframe times that looks like this:

user     time
A        7/7/2010
B        7/12/2010
C        7/12/2010
A        7/12/2010 
C        7/15/2010

I'm using aggregate(time ~ user, times, function(x) sort(as.vector(x))) to get this:

user     time
A        c(7/7/2010, 7/12/2010)
B        c(7/12/2010)
C        c(7/12/2010, 7/15/2010)

The problem is that I have over 20 million entries in times so aggregate is taking a over 4 hours. Is there any alternative using dplyr that will get me the sorted vector of dates?

Updated Answer: Based on your comment, how about this:

library(dplyr)

# Data (with a few additions)
times = read.table(text="user     time
A        7/7/2010
B        7/12/2010
B 7/13/2010
C        7/12/2010
A        7/12/2010 
A 7/13/2010
C        7/15/2010", header=TRUE, stringsAsFactors=FALSE)

times$time = as.Date(times$time, "%m/%d/%Y")

times
  user time 1 A 2010-07-07 2 B 2010-07-12 3 B 2010-07-13 4 C 2010-07-12 5 A 2010-07-12 6 A 2010-07-13 7 C 2010-07-15 
times %>% group_by(user) %>%
  summarise(First=min(time),
            Last=max(time),
            N = n(),
            minDiff=min(diff(time)),
            meanDiff=mean(diff(time)),
            NumDiffUniq = length(unique(diff(time))))
  user First Last N minDiff meanDiff NumDiffUniq 1 A 2010-07-07 2010-07-13 3 1 days 3 days 2 2 B 2010-07-12 2010-07-13 2 1 days 1 days 1 3 C 2010-07-12 2010-07-15 2 3 days 3 days 1 

Original Answer:

I'm not clear on what you're trying to accomplish. If you just want your data frame to be sorted, then with dplyr you would do:

library(dplyr)

times.sorted = times %>% arrange(user, time)

If you want time to become a string of dates for each user , then you could do:

times.summary = times %>% group_by(user) %>%
  summarise(time = paste(time, collapse=","))

But note that for each user this will result in a single string containing the dates.

times.summary
  user time 1 A 7/7/2010,7/12/2010 2 B 7/12/2010 3 C 7/12/2010,7/15/2010 

If you actually want each cell to be a vector of dates, you could make each cell a list (though there might be a better way). For example:

times.new = times %>% group_by(user) %>%
  summarise(time = list(as.vector(time)))

times.new$time
 [[1]] [1] "7/7/2010" "7/12/2010" [[2]] [1] "7/12/2010" [[3]] [1] "7/12/2010" "7/15/2010" 

But if your goal is to analyze your data by group, then you don't actually need to do any of the above. You can use base, dplyr , or data.table functions to perform any analysis by group without first sorting your data.

Based on the dplyr solution by eipi10 and the suggestion of nrussell , I've written the following solution using data.table .

First you need to format the variable times :

times$time = as.Date(times$time, "%m/%d/%Y")

Then you'll need to convert times to a data.table using:

library(data.table)
times <- as.data.table(times)

Overwriting times was useful for my purposes but you may want to instantiate a new variable. After formatting your dataframe as a data.table just do:

new.times <- times[, 
                    .(first = min(time),
                      last = max(time),
                      n = .N,
                      meandiff = mean(diff(time)),
                      mindiff = min(diff(time)),
                      numdiffuniq = length(unique(diff(time))),
                      by='user')]

Running on a linux virtual machine with 128G RAM and using a sample of 1000 entires, the elapsed runtime was 0.43s.

See this tutorial for more on data.table.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM