简体   繁体   中英

R - group_by n_distinct for summarise

My dataset looks like this

library(dyplr) 

dta = rbind(c(1,'F', 0), 
  c(1,'F', 0), 
  c(1,'F', 0), 
  c(2,'F', 1), 
  c(2,'F', 1), 
  c(3,'F', 1), 
  c(3,'F', 1), 
  c(3,'F', 1), 
  c(4,'M', 1), 
  c(4,'M', 1), 
  c(5,'M', 1), 
  c(6,'M', 0)
)

colnames(dta) <- c('id', 'sex', 'child')
dta = as.data.frame(dta)

So the data are long-format with id as personal identifier.

My problem is when I try to count the sex for example, I dont have the right count because of the repetition of the id .

So there is 3 females and 3 males.

but when I count I have

dta %>% 
  group_by(sex) %>% 
  summarise(n())

8 and 4 - because it counted the rows and not the unique id

Same problem with the cross table

dta %>% 
  group_by(sex, child) %>% 
  summarise(n())

How do I indicate the unique identifier ( n_distinct ) in the count ?

There are a ton of different ways to do this, here's one:

dta %>% distinct(id) %>%
        group_by(sex) %>%
        summarise(n())

EDIT: After some discussion, let's test how quick varying methods work.

First, some larger data:

dta <- data.frame(id = rep(1:500, 30),
                  sex = rep (c("M", "F"), 750),
                  child = rep(c(1, 0, 0, 1), 375))

Now let's run our varying methods:

library(microbenchmark)

microbenchmark(
    distinctcount = dta %>% distinct(id) %>% count(sex),
    uniquecount = dta %>% unique %>% count(sex),
    distinctsummarise = dta %>% distinct(id) %>% group_by(sex) %>% summarise(n()),
    uniquesummarise = dta %>% unique %>% group_by(sex) %>% summarise(n()),
    distincttally= dta %>% distinct(id) %>% group_by(sex) %>% tally
)

On my machine:

Unit: milliseconds
              expr       min        lq      mean    median        uq       max neval
     distinctcount  1.576307  1.602803  1.664385  1.630643  1.670195  2.233710   100
       uniquecount 32.391659 32.885479 33.194082 33.072485 33.244516 35.734735   100
 distinctsummarise  1.724914  1.760817  1.815123  1.792114  1.830513  2.178798   100
   uniquesummarise 32.757609 33.080933 33.490001 33.253155 33.463010 39.937194   100
     distincttally  1.618547  1.656947  1.715741  1.685554  1.731058  2.383084   100

We can see unique works pretty poorly on larger data, so the quickest is:

dta %>% distinct(id) %>% count(sex)

Base package:

aggregate(id ~ sex, dta, function(x) length(unique(x))) 

Output:

  sex id
1   F  3
2   M  3

Another alternative with dplyr :

library(dplyr) 
count_(unique(dta), vars = "sex") 

Output:

Source: local data frame [2 x 2]

  sex n
1   F 3
2   M 3

Using sqldf :

library(sqldf)
sqldf("SELECT sex, COUNT(DISTINCT(id)) AS n 
      FROM dta GROUP BY sex")

Output:

  sex n
1   F 3
2   M 3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM