Quick example:
set.seed(123)
library("dplyr")
df <- data_frame(client=sample(letters, 200, replace=T),
content=sample(LETTERS, 200, replace=T))
I have observations of client interacting with content. I want to know how many different contents have been used by each client.
I do the following to obtain what I want:
df %>%
group_by(client, content) %>%
summarize(n=n()) %>%
summarize(n_content=n())
# output
client n_content
(chr) (int)
1 a 3
2 b 4
3 c 5
.. ... ...
The whole point of the first summarize
is to get only one row per client/content combination (since one client may use the same content several times). Therefore the output of the first n()
is useless to me, which makes me think there must be a more efficient/elegant solution.
Is there a way to do this more efficiently? I am looking for a solution ideally compatible with dplyr, but base R or other packages are fine. I am not interested in solutions using data.table
.
You could do:
df %>%
distinct() %>%
count(client)
Source: local data frame [26 x 2]
client n
(chr) (int)
1 a 3
2 b 4
3 c 5
4 d 10
5 e 5
6 f 6
7 g 8
8 h 5
9 i 7
10 j 10
.. ... ...
Or with group_by
df %>%
group_by(client) %>%
summarize(n_content=n_distinct(content))
That way a bit faster
f1=function() df %>%
group_by(client) %>%
summarize(n_content=n_distinct(content))
f2=function()df %>%
distinct() %>%
count(client)
microbenchmark(f1(),f2())
Unit: milliseconds
expr min lq mean median uq max neval cld
f1() 1.884358 1.996009 2.307482 2.123363 2.598729 3.318076 100 a
f2() 2.434831 2.532641 3.031416 2.817830 3.360372 5.462430 100 b
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.