efficient way of summarising multiple times with dplyr

Question

Quick example:

set.seed(123)
library("dplyr")
df <- data_frame(client=sample(letters, 200, replace=T), 
                 content=sample(LETTERS, 200, replace=T))

I have observations of client interacting with content. I want to know how many different contents have been used by each client.

I do the following to obtain what I want:

df %>%
  group_by(client, content) %>%
  summarize(n=n()) %>%
  summarize(n_content=n())

# output
   client n_content
    (chr)     (int)
1       a         3
2       b         4
3       c         5
..    ...       ...

The whole point of the first summarize is to get only one row per client/content combination (since one client may use the same content several times). Therefore the output of the first n() is useless to me, which makes me think there must be a more efficient/elegant solution.

Is there a way to do this more efficiently? I am looking for a solution ideally compatible with dplyr, but base R or other packages are fine. I am not interested in solutions using data.table .

Answer 1

You could do:

df %>%
  distinct() %>%
  count(client)

Source: local data frame [26 x 2]

   client     n
    (chr) (int)
1       a     3
2       b     4
3       c     5
4       d    10
5       e     5
6       f     6
7       g     8
8       h     5
9       i     7
10      j    10
..    ...   ...

Answer 2

Or with group_by

df %>%
  group_by(client) %>%
  summarize(n_content=n_distinct(content))

That way a bit faster

f1=function() df %>%
  group_by(client) %>%
  summarize(n_content=n_distinct(content))

f2=function()df %>%
  distinct() %>%
  count(client)
microbenchmark(f1(),f2())

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval cld
 f1() 1.884358 1.996009 2.307482 2.123363 2.598729 3.318076   100  a 
 f2() 2.434831 2.532641 3.031416 2.817830 3.360372 5.462430   100   b

efficient way of summarising multiple times with dplyr

Question

2 answers

solution1
2 2016-04-14 10:19:30

solution2
2 ACCPTED 2016-04-14 10:21:27

efficient way of summarising multiple times with dplyr

Question

2 answers

solution1 2 2016-04-14 10:19:30

solution2 2 ACCPTED 2016-04-14 10:21:27

solution1
2 2016-04-14 10:19:30

solution2
2 ACCPTED 2016-04-14 10:21:27