What is the best way to count unique values in two columns without reshaping, using dplyr
?
I know that adding multiple arguments into n_distinct
results in counting the combinations of the multiple arguments ( https://github.com/tidyverse/dplyr/issues/1084 ). This is not what I want.
My first guess was to use c()
on the two columns, but the output is not what I expected. Could someone explain where the output comes from?
One possible solution is to use union
. Is there a better alternative?
library(dplyr)
d <- data.frame(Group = c("A", "B", "B", "C", "C", "C"),
node1 = c("a", "b", "b", "c", "c", "c"),
node2 = c("w", "r", "t", "z", "u", "i" )
)
# count unique combinations
d %>%
group_by(Group) %>%
mutate( n = n_distinct( node1, node2))
# A tibble: 6 x 4
# Groups: Group [3]
Group node1 node2 n
<fct> <fct> <fct> <int>
1 A a w 1
2 B b r 2
3 B b t 2
4 C c z 3
5 C c u 3
6 C c i 3
# what happens here?
d %>%
group_by(Group) %>%
mutate( n = n_distinct( c(node1, node2)))
# A tibble: 6 x 4
# Groups: Group [3]
Group node1 node2 n
<fct> <fct> <fct> <int>
1 A a w 2
2 B b r 2
3 B b t 2
4 C c z 4
5 C c u 4
6 C c i 4
# count unique in node1 and node2
d %>%
group_by(Group) %>%
mutate( n = n_distinct( union(node1, node2)))
# A tibble: 6 x 4
# Groups: Group [3]
Group node1 node2 n
<fct> <fct> <fct> <int>
1 A a w 2
2 B b r 3
3 B b t 3
4 C c z 4
5 C c u 4
6 C c i 4
I am working on Ubuntu:
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_CH.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_CH.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_CH.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.0.1
loaded via a namespace (and not attached):
[1] fansi_0.4.0 assertthat_0.2.1 utf8_1.1.4 crayon_1.3.4 R6_2.4.0 lifecycle_0.2.0
[7] magrittr_1.5 pillar_1.4.2 cli_2.0.2 rlang_0.4.7 rstudioapi_0.10 vctrs_0.3.2
[13] generics_0.0.2 tools_3.6.3 glue_1.4.1 purrr_0.3.3 compiler_3.6.3 pkgconfig_2.0.3
[19] tidyselect_1.1.0 tibble_2.1.3
I think your solution with c
and union
is better but to provide an alternative you can use cur_data()
from dplyr 1.0.0
library(dplyr)
d %>% group_by(Group) %>% mutate(n = n_distinct(unlist(cur_data())))
# Group node1 node2 n
# <chr> <chr> <chr> <int>
#1 A a w 2
#2 B b r 3
#3 B b t 3
#4 C c z 4
#5 C c u 4
#6 C c i 4
Note that cur_data()
returns the complete data for each group excluding the grouping variables. So if you have other columns in the data and want to include only "node"
columns in n_distinct
you have to do:
d %>%
group_by(Group) %>%
mutate(n = n_distinct(unlist(select(cur_data(), starts_with('node')))))
An alternative is using c_across()
after dplyr 1.0.0
:
library(dplyr)
d %>%
group_by(Group) %>%
mutate(n = n_distinct(c_across(everything())))
# # A tibble: 6 x 4
# # Groups: Group [3]
# Group node1 node2 n
# <chr> <chr> <chr> <int>
# 1 A a w 2
# 2 B b r 3
# 3 B b t 3
# 4 C c z 4
# 5 C c u 4
# 6 C c i 4
Note: everything()
in c_across()
excludes grouping variables, ie Group
, so actually n_distinct()
takes c(node1, node2)
as input. To specify variables, you can also use
c_across(node1:node2)
c_across(starts_with('node'))
We can reshape to 'long' format and then do a group by n_distinct
library(dplyr)
library(tidyr)
d %>%
pivot_longer(cols = -Group) %>%
group_by(Group) %>%
summarise(n = n_distinct(value)) %>%
left_join(d)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.