简体   繁体   中英

Row frequency of a data frame ignoring column order in R

I want to build a frequency table for the rows of a data frame.

I have found how to do it but taking in consideration the order of the columns. I wish to find the frequencies ignoring the columns order.

As an example for:

0   A       B     
1   B       A     
2   C       D      
3   D       C     
4   C       D

I wish to obtain:

A B 2
C D 3

Thanks in advance.

library("tidyverse")

x <- read.table(
  text = "0   A       B
          1   B       A
          2   C       D
          3   D       C
          4   C       D",
  stringsAsFactors = FALSE)

x %>%
  # Specify the columns to combine explicitly (here V2 and V3)
  # Then sort each pair and paste it into a single string
  mutate(pair = pmap_chr(list(V2, V3),
                         function(...) paste(sort(c(...)), collapse = " "))) %>%
  count(pair)
#> # A tibble: 2 x 2
#>   pair      n
#>   <chr> <int>
#> 1 A B       2
#> 2 C D       3

Created on 2019-03-29 by the reprex package (v0.2.1)

First sort it row-wise and then group by all the columns and count the number of rows.

library(dplyr)
df1 <- data.frame(t(apply(df[-1], 1, sort)))

df1 %>%
   group_by_all() %>%
   summarise(Freq = n())

 #   X1    X2     Freq
 #   <fct> <fct> <int>
 #1  A     B         2
 #2  C     D         3

data

df <- structure(list(V1 = 0:4, V2 = structure(c(1L, 2L, 3L, 4L, 3L), 
.Label = c("A", 
"B", "C", "D"), class = "factor"), V3 = structure(c(2L, 1L, 4L, 
3L, 4L), .Label = c("A", "B", "C", "D"), class = "factor")), class = 
"data.frame", row.names = c(NA, 
-5L))

We can use pmin/pmax to create the grouping variable and should be more efficient

library(dplyr)
df %>%
   count(V2N = pmin(V2, V3), V3N = pmax(V2, V3))
# A tibble: 2 x 3
#  V2N   V3N       n
#   <chr> <chr> <int>
#1 A     B         2
#2 C     D         3

Benchmarks

df1 <- df[rep(seq_len(nrow(df)), 1e6),]
system.time({

df1 %>%
       count(V2N = pmin(V2, V3), V3N = pmax(V2, V3))

 })
#user  system elapsed 
#  1.164   0.043   1.203 


system.time({
df2 <- data.frame(t(apply(df1[-1], 1, sort)))

df2 %>%
   group_by_all() %>%
   summarise(Freq = n())

   })

#   user  system elapsed 
# 160.357   1.227 161.544 

data

df <- structure(list(V1 = 0:4, V2 = c("A", "B", "C", "D", "C"), V3 = c("B", 
  "A", "D", "C", "D")), row.names = c(NA, -5L), class = "data.frame")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM