I am storing (x, y)
values in a dataframe. I want to return the most frequently appearing (x, y)
combination.
Here is an example:
> x = c(1, 1, 2, 3, 4, 5, 6)
> y = c(1, 1, 5, 6, 9, 10, 12)
> xy = data.frame(x, y)
> xy
x y
1 1 1
2 1 1
3 2 5
4 3 6
5 4 9
6 5 10
7 6 12
The most common (x, y)
value would be (1, 1)
.
I tried the answer here for a single column. It works for a single column, but does not work for an aggregate of two columns.
> tail(names(sort(table(xy$x))), 1)
[1] "1"
> tail(names(sort(table(xy$x, xy$y))), 1)
NULL
How do I retrieve the most repeated (x, y) values in two columns in a data frame in R?
EDIT: c(1, 2)
should be considered distinct from c(2, 1)
.
Not sure how will the desired output should look like, but here's a possible solution
res <- table(do.call(paste, xy))
res[which.max(res)]
# 1 1
# 2
In order to get the actual values, one could do
res <- do.call(paste, xy)
xy[which.max(ave(seq(res), res, FUN = length)), ]
# x y
# 1 1 1
(Despite all the plus votes, a hybrid of @DavidArenburg and my approaches
res = do.call("paste", c(xy, sep="\r"))
which.max(tabulate(match(res, res)))
might be simple and effective.)
Maybe it seems a little round-about, but a first step is to transform the possibly arbitrary values in the columns of xy
to integers ranging from 1 to the number of unique values in the column
x = match(xy[[1]], unique(xy[[1]]))
y = match(xy[[2]], unique(xy[[2]]))
Then encode the combination of columns to unique values
v = x + (max(x) - 1L) * y
Indexing minimizes the range of values under consideration, and encoding reduces a two-dimensional problem to a single dimension. These steps reduce the space required of any tabulation (as with table()
in other answers) to the minimum, without creating character vectors.
If one wanted to most common occurrence in a single dimension, then one could index and tabulate v
tbl = tabulate(match(v, v))
and find the index of the first occurrence of the maximum value(s), eg,
df[which.max(tbl),]
Here's a function to do the magic
whichpairmax <- function(x, y) {
x = match(x, unique(x)); y = match(y, unique(y))
v = x + (max(x) - 1L) * y
which.max(tabulate(match(v, v)))
}
and a couple of tests
> set.seed(123)
> xy[whichpairmax(xy[[1]], xy[[2]]),]
x y
1 1 1
> xy1 = xy[sample(nrow(xy)),]
> xy1[whichpairmax(xy1[[1]], xy1[[2]]),]
x y
1 1 1
> xy1
x y
3 2 5
5 4 9
7 6 12
4 3 6
6 5 10
1 1 1
2 1 1
For an arbitrary data.frame
whichdfmax <- function(df) {
v = integer(nrow(df))
for (col in df) {
col = match(col, unique(col))
v = col + (max(col) - 1L) * match(v, unique(v))
}
which.max(tabulate(match(v, v)))
}
Try
library(data.table)
setDT(xy)[, .N,list(x,y)][which.max(N)]
# x y N
#1: 1 1 2
t<-table(xy)
which(t == max(t), arr.ind = TRUE)
Update:
As pointed out by David Arenburg, the initial code returned just the index of the values from the table(xy)
function. If you need the values and maybe the number of occurrences of the max couple you can try this:
t<-table(xy)
indexes <- which(t == max(t), arr.ind = TRUE)[1,]
x_value <- dimnames(t)$x[indexes["x"]]
y_value <- dimnames(t)$y[indexes["y"]]
rep_number <- max(t)
Now I suspect there is better way to write the last three lines of code, but I'm still new to the R world
library(data.table)
DT <- data.table(xy)
tail(DT[, Count := .N, by = c("x", "y")][ order(Count) ], 1)
x y Count
1: 1 1 2
What about this?
x = c(1, 1, 2, 3, 4, 5, 6)
y = c(1, 1, 5, 6, 9, 10, 12)
xy = data.frame(x, y)
table(xy)
y
x 1 5 6 9 10 12
1 2 0 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 0 0 0
4 0 0 0 1 0 0
5 0 0 0 0 1 0
6 0 0 0 0 0 1
library(dplyr)
xy %>%
group_by(x, y) %>%
tally() %>%
ungroup %>%
top_n(1)
With dplyr
library(dplyr)
xy %>% group_by(x, y) %>% summarise(n=n()) %>%
ungroup %>% filter(n==max(n)) %>% select(-n)
Late to the party, but here's a time test:
x<-sample(1:10,1e5,rep=TRUE)
y<-sample(1:10,1e5,rep=TRUE)
martin <- function(x, y) {
x = match(x, unique(x)); y = match(y, unique(y))
v = x + (max(x) - 1L) * y
which.max(tabulate(match(v, v)))
}
akrun <-function(x,y) {
library(data.table)
xy<-data.frame(x,y)
setDT(xy)[, .N,list(x,y)][which.max(N)]
}
mucio <-function(x,y){
xy<-data.frame(x,y)
t<-table(xy)
indexes <- which(t == max(t), arr.ind = TRUE)[1,]
x_value <- dimnames(t)$x[indexes["x"]]
y_value <- dimnames(t)$y[indexes["y"]]
rep_number <- max(t)
}
sam<-function(x,y){
library(dplyr)
xy<-data.frame(x,y)
xy %>%
group_by(x, y) %>%
tally() %>%
ungroup %>%
top_n(1)
}
dimitris<-function(x,y){
library(dplyr)
xy<-data.frame(x,y)
xy %>% group_by(x, y) %>% summarise(n=n()) %>%
ungroup %>% filter(n==max(n)) %>% select(-n)
}
microbenchmark(martin(x,y),akrun(x,y),mucio(x,y),sam(x,y),dimitris(x,y),times=5)
Unit: milliseconds
expr min lq mean median uq
martin(x, y) 11.727217 14.246913 41.359218 14.384385 82.639796
akrun(x, y) 4.426462 4.613420 4.866548 4.892432 5.011406
mucio(x, y) 73.938586 74.037568 103.941459 79.516207 145.232870
sam(x, y) 8.356426 8.586212 8.919787 8.586521 8.775792
dimitris(x, y) 8.618394 8.738228 9.252105 9.063965 9.075298
max neval cld
83.797780 5 a
5.389018 5 a
146.982062 5 b
10.293983 5 a
10.764640 5 a
Using sqldf
:
library(sqldf)
sqldf('SELECT x, y
FROM xy
GROUP BY (x||y)
ORDER BY COUNT(*) DESC
LIMIT 1')
x y
1 1 1
If we'd like to show a frequency column, and not just one row (in case there are any ties):
x = c(1, 1, 2, 3, 4, 12, 12)
y = c(1, 1, 5, 6, 9, 12, 12)
xy = data.frame(x, y)
sqldf('SELECT x, y, COUNT(*) AS freq
FROM xy
GROUP BY (x||y)
ORDER BY COUNT(*) DESC')
x y freq
1 1 1 2
2 12 12 2
3 2 5 1
4 3 6 1
5 4 9 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.