简体   繁体   中英

Retrieve the most repeated (x, y) values in two columns in a data frame

I am storing (x, y) values in a dataframe. I want to return the most frequently appearing (x, y) combination.

Here is an example:

> x = c(1, 1, 2, 3, 4, 5, 6)
> y = c(1, 1, 5, 6, 9, 10, 12)
> xy = data.frame(x, y)
> xy
  x  y
1 1  1
2 1  1
3 2  5
4 3  6
5 4  9
6 5 10
7 6 12

The most common (x, y) value would be (1, 1) .

I tried the answer here for a single column. It works for a single column, but does not work for an aggregate of two columns.

> tail(names(sort(table(xy$x))), 1)
[1] "1"
> tail(names(sort(table(xy$x, xy$y))), 1)
NULL

How do I retrieve the most repeated (x, y) values in two columns in a data frame in R?

EDIT: c(1, 2) should be considered distinct from c(2, 1) .

Not sure how will the desired output should look like, but here's a possible solution

res <- table(do.call(paste, xy))
res[which.max(res)]
# 1 1 
#   2 

In order to get the actual values, one could do

res <- do.call(paste, xy) 
xy[which.max(ave(seq(res), res, FUN = length)), ]
#   x y
# 1 1 1

(Despite all the plus votes, a hybrid of @DavidArenburg and my approaches

res = do.call("paste", c(xy, sep="\r"))
which.max(tabulate(match(res, res)))

might be simple and effective.)

Maybe it seems a little round-about, but a first step is to transform the possibly arbitrary values in the columns of xy to integers ranging from 1 to the number of unique values in the column

x = match(xy[[1]], unique(xy[[1]]))
y = match(xy[[2]], unique(xy[[2]]))

Then encode the combination of columns to unique values

v = x + (max(x) - 1L) * y

Indexing minimizes the range of values under consideration, and encoding reduces a two-dimensional problem to a single dimension. These steps reduce the space required of any tabulation (as with table() in other answers) to the minimum, without creating character vectors.

If one wanted to most common occurrence in a single dimension, then one could index and tabulate v

tbl = tabulate(match(v, v))

and find the index of the first occurrence of the maximum value(s), eg,

df[which.max(tbl),]

Here's a function to do the magic

whichpairmax <- function(x, y) {
    x = match(x, unique(x)); y = match(y, unique(y))
    v = x + (max(x) - 1L) * y
    which.max(tabulate(match(v, v)))
}

and a couple of tests

> set.seed(123)
> xy[whichpairmax(xy[[1]], xy[[2]]),]
  x y
1 1 1
> xy1 = xy[sample(nrow(xy)),]
> xy1[whichpairmax(xy1[[1]], xy1[[2]]),]
  x y
1 1 1
> xy1
  x  y
3 2  5
5 4  9
7 6 12
4 3  6
6 5 10
1 1  1
2 1  1

For an arbitrary data.frame

whichdfmax <- function(df) {
    v = integer(nrow(df))
    for (col in df) {
        col = match(col, unique(col))
        v = col + (max(col) - 1L) * match(v, unique(v))
    }
    which.max(tabulate(match(v, v)))
}

Try

library(data.table)
setDT(xy)[, .N,list(x,y)][which.max(N)]
#   x y N
#1: 1 1 2
t<-table(xy)
which(t == max(t), arr.ind = TRUE)

Update:

As pointed out by David Arenburg, the initial code returned just the index of the values from the table(xy) function. If you need the values and maybe the number of occurrences of the max couple you can try this:

t<-table(xy)
indexes <- which(t == max(t), arr.ind = TRUE)[1,]
x_value <- dimnames(t)$x[indexes["x"]]
y_value <- dimnames(t)$y[indexes["y"]]
rep_number <- max(t)

Now I suspect there is better way to write the last three lines of code, but I'm still new to the R world

library(data.table)
DT <- data.table(xy)
tail(DT[, Count := .N, by = c("x", "y")][ order(Count) ], 1)
    x y Count
 1: 1 1     2

What about this?

x = c(1, 1, 2, 3, 4, 5, 6)
y = c(1, 1, 5, 6, 9, 10, 12)
xy = data.frame(x, y)

table(xy)
y
x   1 5 6 9 10 12
1 2 0 0 0  0  0
2 0 1 0 0  0  0
3 0 0 1 0  0  0
4 0 0 0 1  0  0
5 0 0 0 0  1  0
6 0 0 0 0  0  1
library(dplyr)
xy %>%
  group_by(x, y) %>%
  tally() %>%
  ungroup %>%
  top_n(1)

With dplyr

library(dplyr)

xy %>% group_by(x, y) %>% summarise(n=n()) %>% 
   ungroup %>% filter(n==max(n)) %>% select(-n)

Late to the party, but here's a time test:

x<-sample(1:10,1e5,rep=TRUE)
y<-sample(1:10,1e5,rep=TRUE)


martin  <- function(x, y) {
    x = match(x, unique(x)); y = match(y, unique(y))
    v = x + (max(x) - 1L) * y
    which.max(tabulate(match(v, v)))
}
akrun <-function(x,y) {
    library(data.table)
    xy<-data.frame(x,y)
setDT(xy)[, .N,list(x,y)][which.max(N)]
}
mucio <-function(x,y){
    xy<-data.frame(x,y)
    t<-table(xy)
indexes <- which(t == max(t), arr.ind = TRUE)[1,]
x_value <- dimnames(t)$x[indexes["x"]]
y_value <- dimnames(t)$y[indexes["y"]]
rep_number <- max(t)

}

sam<-function(x,y){
    library(dplyr)
    xy<-data.frame(x,y)
xy %>%
  group_by(x, y) %>%
  tally() %>%
  ungroup %>%
  top_n(1)

}
dimitris<-function(x,y){
    library(dplyr)
xy<-data.frame(x,y)
xy %>% group_by(x, y) %>% summarise(n=n()) %>% 
   ungroup %>% filter(n==max(n)) %>% select(-n)

}

microbenchmark(martin(x,y),akrun(x,y),mucio(x,y),sam(x,y),dimitris(x,y),times=5)

Unit: milliseconds
           expr       min        lq       mean    median         uq
   martin(x, y) 11.727217 14.246913  41.359218 14.384385  82.639796
    akrun(x, y)  4.426462  4.613420   4.866548  4.892432   5.011406
    mucio(x, y) 73.938586 74.037568 103.941459 79.516207 145.232870
      sam(x, y)  8.356426  8.586212   8.919787  8.586521   8.775792
 dimitris(x, y)  8.618394  8.738228   9.252105  9.063965   9.075298
        max neval cld
  83.797780     5  a 
   5.389018     5  a 
 146.982062     5   b
  10.293983     5  a 
  10.764640     5  a

Using sqldf :

library(sqldf)    
sqldf('SELECT x, y 
          FROM xy 
          GROUP BY (x||y) 
          ORDER BY COUNT(*) DESC 
          LIMIT 1')
  x y
1 1 1 

If we'd like to show a frequency column, and not just one row (in case there are any ties):

x = c(1, 1, 2, 3, 4, 12, 12)
y = c(1, 1, 5, 6, 9, 12, 12)
xy = data.frame(x, y)

sqldf('SELECT x, y, COUNT(*) AS freq
      FROM xy 
      GROUP BY (x||y) 
      ORDER BY COUNT(*) DESC')

   x  y freq
1  1  1    2
2 12 12    2
3  2  5    1
4  3  6    1
5  4  9    1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM