简体   繁体   中英

dplyr count unique values in two columns without reshaping long

What is the best way to count unique values in two columns without reshaping, using dplyr ?

I know that adding multiple arguments into n_distinct results in counting the combinations of the multiple arguments ( https://github.com/tidyverse/dplyr/issues/1084 ). This is not what I want.

My first guess was to use c() on the two columns, but the output is not what I expected. Could someone explain where the output comes from?

One possible solution is to use union . Is there a better alternative?

library(dplyr)
d <- data.frame(Group = c("A", "B", "B", "C", "C", "C"),
                node1 = c("a", "b", "b", "c", "c", "c"),
                node2 = c("w", "r", "t", "z", "u", "i" )
                )



# count unique combinations
d %>%
  group_by(Group) %>%
  mutate( n = n_distinct( node1, node2))

# A tibble: 6 x 4
# Groups:   Group [3]
  Group node1 node2     n
  <fct> <fct> <fct> <int>
1 A     a     w         1
2 B     b     r         2
3 B     b     t         2
4 C     c     z         3
5 C     c     u         3
6 C     c     i         3



# what happens here?
d %>%
  group_by(Group) %>%
  mutate( n = n_distinct( c(node1, node2)))

# A tibble: 6 x 4
# Groups:   Group [3]
  Group node1 node2     n
  <fct> <fct> <fct> <int>
1 A     a     w         2
2 B     b     r         2
3 B     b     t         2
4 C     c     z         4
5 C     c     u         4
6 C     c     i         4



# count unique in node1 and node2
d %>%
  group_by(Group) %>%
  mutate( n = n_distinct( union(node1, node2)))

# A tibble: 6 x 4
# Groups:   Group [3]
  Group node1 node2     n
  <fct> <fct> <fct> <int>
1 A     a     w         2
2 B     b     r         3
3 B     b     t         3
4 C     c     z         4
5 C     c     u         4
6 C     c     i         4

I am working on Ubuntu:

sessionInfo() 


R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_CH.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_CH.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_CH.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.1

loaded via a namespace (and not attached):
 [1] fansi_0.4.0      assertthat_0.2.1 utf8_1.1.4       crayon_1.3.4     R6_2.4.0         lifecycle_0.2.0 
 [7] magrittr_1.5     pillar_1.4.2     cli_2.0.2        rlang_0.4.7      rstudioapi_0.10  vctrs_0.3.2     
[13] generics_0.0.2   tools_3.6.3      glue_1.4.1       purrr_0.3.3      compiler_3.6.3   pkgconfig_2.0.3 
[19] tidyselect_1.1.0 tibble_2.1.3 

I think your solution with c and union is better but to provide an alternative you can use cur_data() from dplyr 1.0.0

library(dplyr)
d %>% group_by(Group) %>% mutate(n = n_distinct(unlist(cur_data())))


#  Group node1 node2     n
#  <chr> <chr> <chr> <int>
#1 A     a     w         2
#2 B     b     r         3
#3 B     b     t         3
#4 C     c     z         4
#5 C     c     u         4
#6 C     c     i         4

Note that cur_data() returns the complete data for each group excluding the grouping variables. So if you have other columns in the data and want to include only "node" columns in n_distinct you have to do:

d %>%
  group_by(Group) %>%
  mutate(n = n_distinct(unlist(select(cur_data(), starts_with('node')))))

An alternative is using c_across() after dplyr 1.0.0 :

library(dplyr)

d %>%
  group_by(Group) %>%
  mutate(n = n_distinct(c_across(everything())))

# # A tibble: 6 x 4
# # Groups:   Group [3]
#   Group node1 node2     n
#   <chr> <chr> <chr> <int>
# 1 A     a     w         2
# 2 B     b     r         3
# 3 B     b     t         3
# 4 C     c     z         4
# 5 C     c     u         4
# 6 C     c     i         4

Note: everything() in c_across() excludes grouping variables, ie Group , so actually n_distinct() takes c(node1, node2) as input. To specify variables, you can also use

  • c_across(node1:node2)
  • c_across(starts_with('node'))

We can reshape to 'long' format and then do a group by n_distinct

library(dplyr)
library(tidyr)
d %>%
     pivot_longer(cols = -Group) %>% 
     group_by(Group) %>%
     summarise(n = n_distinct(value)) %>%
     left_join(d)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM