[英]Calculating the mode or 2nd/3rd/4th most common value
Surely there has to be a function out there in some package for this?在某些 package 中肯定有一个 function 吗?
I've searched and I've found this function to calculate the mode:我已经搜索过,我发现这个 function 来计算模式:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
But I'd like a function that lets me easily calculate the 2nd/3rd/4th/nth most common value in a column of data.但我想要一个 function,它可以让我轻松计算一列数据中的第 2/3/4/n 个最常见值。
Ultimately I will apply this function to a large number of dplyr::group_by()
s.最终我会将这个 function 应用到大量的dplyr::group_by()
s。
Thank you for your help!谢谢您的帮助!
Maybe you could try 也许你可以尝试
f <- function (x) with(rle(sort(x)), values[order(lengths, decreasing = TRUE)])
This gives unique vector values sorted by decreasing frequency. 这给出了按频率递减排序的唯一矢量值。 The first will be the mode, the 2nd will be 2nd most common, etc. 第一个是模式,第二个是最常见的第二个,依此类推。
Another method is to based on table()
: 另一种方法是基于table()
:
g <- function (x) as.numeric(names(sort(table(x), decreasing = TRUE)))
But this is not recommended, as input vector x
will be coerced to factor first. 但是不建议这样做,因为输入向量x
将被强制首先分解。 If you have a large vector, this is very slow. 如果向量很大,这将非常慢。 Also on exit, we have to extract character names and of the table and coerce it to numeric. 同样在退出时,我们必须提取表的字符名称和并将其强制为数字。
Example 例
set.seed(0); x <- rpois(100, 10)
f(x)
# [1] 11 12 7 9 8 13 10 14 5 15 6 2 3 16
Let's compare with the contingency table from table
: 让我们从联表比较table
:
tab <- sort(table(x), decreasing = TRUE)
# 11 12 7 9 8 13 10 14 5 15 6 2 3 16
# 14 14 11 11 10 10 9 7 5 4 2 1 1 1
as.numeric(names(tab))
# [1] 11 12 7 9 8 13 10 14 5 15 6 2 3 16
So the results are the same. 因此结果是相同的。
Here is an R function that I made (inspired by several other SO posts), which may work for your goal (and I use a local dataset on religious affiliation to illustrate it):这是我制作的 R function(受其他几篇 SO 帖子的启发),它可能适用于您的目标(我使用有关宗教信仰的本地数据集来说明它):
It's simple;这很简单; only R base functions are involved: length, match, sort, tabulate, table, unique, which, as.character.仅涉及 R 基本函数:长度、匹配、排序、制表、表格、唯一、其中、as.character。
Find_Nth_Mode = function(d, N = 2) {
maxN = function(x, N){
len = length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N = length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
(ux = unique(as.character(d)))
(match(d, ux))
(a1 = tabulate(match(d, ux)))
(a2 = maxN(a1, N))
(a3 = which(a1 == a2))
(ux[a3])
}
Sample Output样品 Output
> table(religion_data$relig11)
0.None 1.Protestant_Conservative 2.Protestant_Liberal 3.Catholic
34486 6134 19678 36880
4.Orthodox 5.Islam_Sunni 6.Islam_Shia 7.Hindu
20702 28170 668 4653
8.Buddhism 9.Jewish 10.Other
9983 381 6851
> Find_Nth_Mode(religion_data$relig11, 1)
[1] "3.Catholic"
> Find_Nth_Mode(religion_data$relig11, 2)
[1] "0.None"
> Find_Nth_Mode(religion_data$relig11, 3)
[1] "5.Islam_Sunni"
Reference: I want to express my gratitude to these posts, from which I get the two functions and integrate them into one:参考:我要感谢这些帖子,我从中得到了两个功能并将它们整合为一个:
function to find the N th largest value: Fastest way to find second (third...) highest/lowest value in vector or column function 找到第 N 个最大值: 在向量或列中找到第二(第三...)最高/最低值的最快方法
how to find the second largest mode value?如何找到第二大众数值? Calculating the mode or 2nd/3rd/4th most common value 计算模式或第 2/3/4 个最常用值
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.