[英]How to retrieve the most repeated value in a column present in a data frame
I am trying to retrieve the most repeated value in a particular column present in a data frame.Here is my sample data and code below.A我正在尝试检索数据框中特定列中重复次数最多的值。以下是我的示例数据和代码。A
data("Forbes2000", package = "HSAUR")
head(Forbes2000)
rank name country category sales profits assets marketvalue
1 1 Citigroup United States Banking 94.71 17.85 1264.03 255.30
2 2 General Electric United States Conglomerates 134.19 15.59 626.93 328.54
3 3 American Intl Group United States Insurance 76.66 6.46 647.66 194.87
4 4 ExxonMobil United States Oil & gas operations 222.88 20.96 166.99 277.02
5 5 BP United Kingdom Oil & gas operations 232.57 10.27 177.57 173.54
6 6 Bank of America United States Banking 49.01 10.81 736.45 117.55
As per my sample data I need to return the most repeated category which is Insurance.根据我的示例数据,我需要返回最重复的类别,即保险。
subset(subset(Forbes2000,country=="Bermuda")
tail(names(sort(table(Forbes2000$category))), 1)
In case two or more categories may be tied for most frequent, use something like this:如果两个或多个类别可能与最频繁相关,请使用以下内容:
x <- c("Insurance", "Insurance", "Capital Goods", "Food markets", "Food markets")
tt <- table(x)
names(tt[tt==max(tt)])
[1] "Food markets" "Insurance"
Another way with the data.table package, which is faster for large data sets: data.table 包的另一种方式,对于大型数据集来说更快:
set.seed(1)
x=sample(seq(1,100), 5000000, replace = TRUE)
method 1 (solution proposed above)方法1(上面提出的解决方案)
start.time <- Sys.time()
tt <- table(x)
names(tt[tt==max(tt)])
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 4.883488 secs时差 4.883488 秒
method 2 (DATA TABLE)方法二(数据表)
start.time <- Sys.time()
ds <- data.table( x )
setkey(ds, x)
sorted <- ds[,.N,by=list(x)]
most_repeated_value <- sorted[order(-N)]$x[1]
most_repeated_value
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.328033 secs 0.328033秒的时间差
I know my answer is coming a little late, but I built the following function that does the job in less than a second for my dataframe that contains more than 50,000 rows:我知道我的答案来得有点晚,但我构建了以下函数,该函数可以在不到一秒钟的时间内为包含超过 50,000 行的数据帧完成工作:
print_count_of_unique_values <- function(df, column_name, remove_items_with_freq_equal_or_lower_than = 0, return_df = F,
sort_desc = T, return_most_frequent_value = F)
{
temp <- df[column_name]
output <- as.data.frame(table(temp))
names(output) <- c("Item","Frequency")
output_df <- output[ output[[2]] > remove_items_with_freq_equal_or_lower_than, ]
if (sort_desc){
output_df <- output_df[order(output_df[[2]], decreasing = T), ]
}
cat("\nThis is the (head) count of the unique values in dataframe column '", column_name,"':\n")
print(head(output_df))
if (return_df){
return(output_df)
}
if (return_most_frequent_value){
output_df$Item <- as.character(output_df$Item)
output_df$Frequency <- as.numeric(output_df$Frequency)
most_freq_item <- output_df[1, "Item"]
cat("\nReturning most frequent item: ", most_freq_item)
return(most_freq_item)
}
}
so if you have a dataframe called "df" and a column called "name" and you want to know the most comment value in the "name" column, you could run:因此,如果您有一个名为“df”的数据框和一个名为“name”的列,并且您想知道“name”列中评论最多的值,则可以运行:
most_common_name <- print_count_of_unique_values(df=df, column_name = "name", return_most_frequent_value = T)
you can create a function:你可以创建一个函数:
get_mode <- function(x){
return(names(sort(table(x), decreasing = T, na.last = T)[1]))
}
and then do然后做
get_mode(Forbes3000$category)
The reason I created a function is that I have to this kind of thing very often.我创建函数的原因是我必须经常做这种事情。
You can use table(Forbes2000$CategoryName, useNA="ifany")
.您可以使用table(Forbes2000$CategoryName, useNA="ifany")
。 This will give you the list of all possible values in the chosen category and the number of times each value was used in that particular data frame.这将为您提供所选类别中所有可能值的列表以及每个值在该特定数据框中使用的次数。
The following is the easiest (for me) to read and to remember:以下是最容易阅读和记住的(对我来说):
names(which.max(table(Forbes2000$category)))
Extra notes on efficiency: This approach avoids sorting the table entries (finding the max is cheaper than a full sort).关于效率的额外说明:这种方法避免了对表条目进行排序(找到最大值比完全排序便宜)。 The most efficient solution would avoid a full tabulation.最有效的解决方案是避免完整的制表。 You can imagine an Rcpp solution that loops through the source vector and keeps a running tabulation but stops before the end, when the contest is already over.您可以想象一个 Rcpp 解决方案,它循环遍历源向量并保持一个正在运行的制表,但在比赛已经结束时停止。 If anyone writes that solution, ping me so I can give you a +1 and edit this answer to reference your answer.如果有人写了该解决方案,请 ping 我,以便我可以给您 +1 并编辑此答案以参考您的答案。
I suggest Rfast::Table
.我建议Rfast::Table
。
Rfast::Table(as.character(Forbes2000$CategoryName))
the you can get the maximum value.您可以获得最大值。
Using the function option from @Malvika makes it easy to apply across a table and get these values for every column使用@Malvika 中的函数选项可以轻松地跨表应用并为每一列获取这些值
#create a mode function
get_mode_name <- function(x){
return(names(sort(table(x), decreasing = T, na.last = T)[1]))
}
get_mode_value <- function(x){
return(unname(sort(table(x), decreasing = T, na.last = T)[1]))
}
get_mode_pct<- function(x){
return(unname(sort(table(x), decreasing = T, na.last = T)[1])/length(x))
}
#Identify character columns
type_table <- sapply(table_name, class)
#create vector numeric and character types
num_table <- (unname(type_table) == "numeric")
char_table <- (unname(type_table) == "character")
#View the modes of character columns
mode_name <- apply(table_name[,char_table], 2, function(x) get_mode_name(x))
mode_value <- apply(table_name[,char_table], 2, function(x) get_mode_value(x))
mode_pct <- apply(table_name[,char_table], 2, function(x) get_mode_pct(x))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.