简体   繁体   English

检索数据框中两列中最重复的(x,y)值

[英]Retrieve the most repeated (x, y) values in two columns in a data frame

I am storing (x, y) values in a dataframe. 我将(x, y)值存储在数据帧中。 I want to return the most frequently appearing (x, y) combination. 我想返回最常出现的(x, y)组合。

Here is an example: 这是一个例子:

> x = c(1, 1, 2, 3, 4, 5, 6)
> y = c(1, 1, 5, 6, 9, 10, 12)
> xy = data.frame(x, y)
> xy
  x  y
1 1  1
2 1  1
3 2  5
4 3  6
5 4  9
6 5 10
7 6 12

The most common (x, y) value would be (1, 1) . 最常见的(x, y)值将是(1, 1)

I tried the answer here for a single column. 我在这里尝试了一个专栏的答案。 It works for a single column, but does not work for an aggregate of two columns. 它适用于单个列,但不适用于两列的聚合。

> tail(names(sort(table(xy$x))), 1)
[1] "1"
> tail(names(sort(table(xy$x, xy$y))), 1)
NULL

How do I retrieve the most repeated (x, y) values in two columns in a data frame in R? 如何在R中的数据框中的两列中检索最重复的(x,y)值?

EDIT: c(1, 2) should be considered distinct from c(2, 1) . 编辑: c(1, 2)应被视为与c(2, 1)

Not sure how will the desired output should look like, but here's a possible solution 不确定所需的输出应该如何,但这是一个可能的解决方案

res <- table(do.call(paste, xy))
res[which.max(res)]
# 1 1 
#   2 

In order to get the actual values, one could do 为了获得实际值,人们可以做到

res <- do.call(paste, xy) 
xy[which.max(ave(seq(res), res, FUN = length)), ]
#   x y
# 1 1 1

(Despite all the plus votes, a hybrid of @DavidArenburg and my approaches (尽管所有的加票都是@DavidArenburg和我的方法的混合体

res = do.call("paste", c(xy, sep="\r"))
which.max(tabulate(match(res, res)))

might be simple and effective.) 可能简单有效。)

Maybe it seems a little round-about, but a first step is to transform the possibly arbitrary values in the columns of xy to integers ranging from 1 to the number of unique values in the column 也许它似乎有点圆,但第一步是将xy列中可能的任意值转换为整数,范围从1到列中唯一值的数量

x = match(xy[[1]], unique(xy[[1]]))
y = match(xy[[2]], unique(xy[[2]]))

Then encode the combination of columns to unique values 然后将列组合编码为唯一值

v = x + (max(x) - 1L) * y

Indexing minimizes the range of values under consideration, and encoding reduces a two-dimensional problem to a single dimension. 索引最小化了所考虑的值的范围,并且编码将二维问题简化为单个维度。 These steps reduce the space required of any tabulation (as with table() in other answers) to the minimum, without creating character vectors. 这些步骤将任何制表所需的空间(与其他答案中的table() )减少到最小,而不创建字符向量。

If one wanted to most common occurrence in a single dimension, then one could index and tabulate v 如果想要在单个维度中最常见,那么可以索引和制表v

tbl = tabulate(match(v, v))

and find the index of the first occurrence of the maximum value(s), eg, 并找到第一次出现的最大值的索引,例如,

df[which.max(tbl),]

Here's a function to do the magic 这是一个魔术的功能

whichpairmax <- function(x, y) {
    x = match(x, unique(x)); y = match(y, unique(y))
    v = x + (max(x) - 1L) * y
    which.max(tabulate(match(v, v)))
}

and a couple of tests 和几个测试

> set.seed(123)
> xy[whichpairmax(xy[[1]], xy[[2]]),]
  x y
1 1 1
> xy1 = xy[sample(nrow(xy)),]
> xy1[whichpairmax(xy1[[1]], xy1[[2]]),]
  x y
1 1 1
> xy1
  x  y
3 2  5
5 4  9
7 6 12
4 3  6
6 5 10
1 1  1
2 1  1

For an arbitrary data.frame 对于任意data.frame

whichdfmax <- function(df) {
    v = integer(nrow(df))
    for (col in df) {
        col = match(col, unique(col))
        v = col + (max(col) - 1L) * match(v, unique(v))
    }
    which.max(tabulate(match(v, v)))
}

Try 尝试

library(data.table)
setDT(xy)[, .N,list(x,y)][which.max(N)]
#   x y N
#1: 1 1 2
t<-table(xy)
which(t == max(t), arr.ind = TRUE)

Update: 更新:

As pointed out by David Arenburg, the initial code returned just the index of the values from the table(xy) function. 正如David Arenburg所指出的,初始代码只返回table(xy)函数中值的索引。 If you need the values and maybe the number of occurrences of the max couple you can try this: 如果您需要值,可能还有最大对的出现次数,您可以尝试:

t<-table(xy)
indexes <- which(t == max(t), arr.ind = TRUE)[1,]
x_value <- dimnames(t)$x[indexes["x"]]
y_value <- dimnames(t)$y[indexes["y"]]
rep_number <- max(t)

Now I suspect there is better way to write the last three lines of code, but I'm still new to the R world 现在我怀疑有更好的方法来编写最后三行代码,但我仍然是R世界的新手

library(data.table)
DT <- data.table(xy)
tail(DT[, Count := .N, by = c("x", "y")][ order(Count) ], 1)
    x y Count
 1: 1 1     2

What about this? 那这个呢?

x = c(1, 1, 2, 3, 4, 5, 6)
y = c(1, 1, 5, 6, 9, 10, 12)
xy = data.frame(x, y)

table(xy)
y
x   1 5 6 9 10 12
1 2 0 0 0  0  0
2 0 1 0 0  0  0
3 0 0 1 0  0  0
4 0 0 0 1  0  0
5 0 0 0 0  1  0
6 0 0 0 0  0  1
library(dplyr)
xy %>%
  group_by(x, y) %>%
  tally() %>%
  ungroup %>%
  top_n(1)

With dplyr dplyr

library(dplyr)

xy %>% group_by(x, y) %>% summarise(n=n()) %>% 
   ungroup %>% filter(n==max(n)) %>% select(-n)

Late to the party, but here's a time test: 迟到了,但是这里有时间测试:

x<-sample(1:10,1e5,rep=TRUE)
y<-sample(1:10,1e5,rep=TRUE)


martin  <- function(x, y) {
    x = match(x, unique(x)); y = match(y, unique(y))
    v = x + (max(x) - 1L) * y
    which.max(tabulate(match(v, v)))
}
akrun <-function(x,y) {
    library(data.table)
    xy<-data.frame(x,y)
setDT(xy)[, .N,list(x,y)][which.max(N)]
}
mucio <-function(x,y){
    xy<-data.frame(x,y)
    t<-table(xy)
indexes <- which(t == max(t), arr.ind = TRUE)[1,]
x_value <- dimnames(t)$x[indexes["x"]]
y_value <- dimnames(t)$y[indexes["y"]]
rep_number <- max(t)

}

sam<-function(x,y){
    library(dplyr)
    xy<-data.frame(x,y)
xy %>%
  group_by(x, y) %>%
  tally() %>%
  ungroup %>%
  top_n(1)

}
dimitris<-function(x,y){
    library(dplyr)
xy<-data.frame(x,y)
xy %>% group_by(x, y) %>% summarise(n=n()) %>% 
   ungroup %>% filter(n==max(n)) %>% select(-n)

}

microbenchmark(martin(x,y),akrun(x,y),mucio(x,y),sam(x,y),dimitris(x,y),times=5)

Unit: milliseconds
           expr       min        lq       mean    median         uq
   martin(x, y) 11.727217 14.246913  41.359218 14.384385  82.639796
    akrun(x, y)  4.426462  4.613420   4.866548  4.892432   5.011406
    mucio(x, y) 73.938586 74.037568 103.941459 79.516207 145.232870
      sam(x, y)  8.356426  8.586212   8.919787  8.586521   8.775792
 dimitris(x, y)  8.618394  8.738228   9.252105  9.063965   9.075298
        max neval cld
  83.797780     5  a 
   5.389018     5  a 
 146.982062     5   b
  10.293983     5  a 
  10.764640     5  a

Using sqldf : 使用sqldf

library(sqldf)    
sqldf('SELECT x, y 
          FROM xy 
          GROUP BY (x||y) 
          ORDER BY COUNT(*) DESC 
          LIMIT 1')
  x y
1 1 1 

If we'd like to show a frequency column, and not just one row (in case there are any ties): 如果我们想显示一个频率列,而不只是一行(如果有任何关系):

x = c(1, 1, 2, 3, 4, 12, 12)
y = c(1, 1, 5, 6, 9, 12, 12)
xy = data.frame(x, y)

sqldf('SELECT x, y, COUNT(*) AS freq
      FROM xy 
      GROUP BY (x||y) 
      ORDER BY COUNT(*) DESC')

   x  y freq
1  1  1    2
2 12 12    2
3  2  5    1
4  3  6    1
5  4  9    1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM