简体   繁体   English

如何检索数据框中存在的列中重复次数最多的值

[英]How to retrieve the most repeated value in a column present in a data frame

I am trying to retrieve the most repeated value in a particular column present in a data frame.Here is my sample data and code below.A我正在尝试检索数据框中特定列中重复次数最多的值。以下是我的示例数据和代码。A

data("Forbes2000", package = "HSAUR")
head(Forbes2000)


  rank                name        country             category  sales profits  assets marketvalue
1    1           Citigroup  United States              Banking  94.71   17.85 1264.03      255.30
2    2    General Electric  United States        Conglomerates 134.19   15.59  626.93      328.54
3    3 American Intl Group  United States            Insurance  76.66    6.46  647.66      194.87
4    4          ExxonMobil  United States Oil & gas operations 222.88   20.96  166.99      277.02
5    5                  BP United Kingdom Oil & gas operations 232.57   10.27  177.57      173.54
6    6     Bank of America  United States              Banking  49.01   10.81  736.45      117.55

As per my sample data I need to return the most repeated category which is Insurance.根据我的示例数据,我需要返回最重复的类别,即保险。

subset(subset(Forbes2000,country=="Bermuda")
tail(names(sort(table(Forbes2000$category))), 1)

In case two or more categories may be tied for most frequent, use something like this:如果两个或多个类别可能与最频繁相关,请使用以下内容:

x <- c("Insurance", "Insurance", "Capital Goods", "Food markets", "Food markets")
tt <- table(x)
names(tt[tt==max(tt)])
[1] "Food markets" "Insurance" 

Another way with the data.table package, which is faster for large data sets: data.table 包的另一种方式,对于大型数据集来说更快:

set.seed(1)
x=sample(seq(1,100), 5000000, replace = TRUE)

method 1 (solution proposed above)方法1(上面提出的解决方案)

start.time <- Sys.time()
tt <- table(x)
names(tt[tt==max(tt)])
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Time difference of 4.883488 secs时差 4.883488 秒

method 2 (DATA TABLE)方法二(数据表)

start.time <- Sys.time()
ds <- data.table( x )
setkey(ds, x)
sorted <- ds[,.N,by=list(x)]

most_repeated_value <- sorted[order(-N)]$x[1]
most_repeated_value

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Time difference of 0.328033 secs 0.328033秒的时间差

I know my answer is coming a little late, but I built the following function that does the job in less than a second for my dataframe that contains more than 50,000 rows:我知道我的答案来得有点晚,但我构建了以下函数,该函数可以在不到一秒钟的时间内为包含超过 50,000 行的数据帧完成工作:

print_count_of_unique_values <- function(df, column_name, remove_items_with_freq_equal_or_lower_than = 0, return_df = F, 
                                         sort_desc = T, return_most_frequent_value = F)
{
  temp <- df[column_name]
  output <- as.data.frame(table(temp))
  names(output) <- c("Item","Frequency")
  output_df <- output[  output[[2]] > remove_items_with_freq_equal_or_lower_than,  ]

  if (sort_desc){
    output_df <- output_df[order(output_df[[2]], decreasing = T), ]
  }

  cat("\nThis is the (head) count of the unique values in dataframe column '", column_name,"':\n")
  print(head(output_df))

  if (return_df){
    return(output_df)
  }

  if (return_most_frequent_value){
      output_df$Item <- as.character(output_df$Item)
      output_df$Frequency <- as.numeric(output_df$Frequency)
      most_freq_item <- output_df[1, "Item"]
      cat("\nReturning most frequent item: ", most_freq_item)
      return(most_freq_item)
  }
}

so if you have a dataframe called "df" and a column called "name" and you want to know the most comment value in the "name" column, you could run:因此,如果您有一个名为“df”的数据框和一个名为“name”的列,并且您想知道“name”列中评论最多的值,则可以运行:

most_common_name <- print_count_of_unique_values(df=df, column_name = "name", return_most_frequent_value = T)    

you can create a function:你可以创建一个函数:

get_mode <- function(x){
  return(names(sort(table(x), decreasing = T, na.last = T)[1]))
}

and then do然后做

get_mode(Forbes3000$category)

The reason I created a function is that I have to this kind of thing very often.我创建函数的原因是我必须经常做这种事情。

You can use table(Forbes2000$CategoryName, useNA="ifany") .您可以使用table(Forbes2000$CategoryName, useNA="ifany") This will give you the list of all possible values in the chosen category and the number of times each value was used in that particular data frame.这将为您提供所选类别中所有可能值的列表以及每个值在该特定数据框中使用的次数。

The following is the easiest (for me) to read and to remember:以下是最容易阅读和记住的(对我来说):

names(which.max(table(Forbes2000$category)))

Extra notes on efficiency: This approach avoids sorting the table entries (finding the max is cheaper than a full sort).关于效率的额外说明:这种方法避免了对表条目进行排序(找到最大值比完全排序便宜)。 The most efficient solution would avoid a full tabulation.最有效的解决方案是避免完整的制表。 You can imagine an Rcpp solution that loops through the source vector and keeps a running tabulation but stops before the end, when the contest is already over.您可以想象一个 Rcpp 解决方案,它循环遍历源向量并保持一个正在运行的制表,但在比赛已经结束时停止。 If anyone writes that solution, ping me so I can give you a +1 and edit this answer to reference your answer.如果有人写了该解决方案,请 ping 我,以便我可以给您 +1 并编辑此答案以参考您的答案。

I suggest Rfast::Table .我建议Rfast::Table

Rfast::Table(as.character(Forbes2000$CategoryName))

the you can get the maximum value.您可以获得最大值。

Using the function option from @Malvika makes it easy to apply across a table and get these values for every column使用@Malvika 中的函数选项可以轻松地跨表应用并为每一列获取这些值

#create a mode function
get_mode_name <- function(x){
  return(names(sort(table(x), decreasing = T, na.last = T)[1]))
}

get_mode_value <- function(x){
  return(unname(sort(table(x), decreasing = T, na.last = T)[1]))
}

get_mode_pct<- function(x){
  return(unname(sort(table(x), decreasing = T, na.last = T)[1])/length(x))
}

#Identify character columns
type_table <- sapply(table_name, class)

#create vector numeric and character types
num_table <- (unname(type_table) == "numeric")
char_table <- (unname(type_table) == "character")

#View the modes of character columns
mode_name <- apply(table_name[,char_table], 2, function(x) get_mode_name(x))    
mode_value <- apply(table_name[,char_table], 2, function(x) get_mode_value(x))
mode_pct <- apply(table_name[,char_table], 2, function(x) get_mode_pct(x))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 检索数据框中两列中最重复的(x,y)值 - Retrieve the most repeated (x, y) values in two columns in a data frame 如何为数据框 A 中存在的列的每个唯一值在数据框 B 中创建一个新列? - How can I create a new column in a data frame B for each unique value of a column present in data frame A? 如何获得在数据框中显示的最重复的值或名称 - How to achieve the most repeated values or names to show in a data frame 在数据框中创建表示另一列重复值的列 - Create Column in Data Frame that Indicates Repeated Value in Another Column 如何通过匹配来自另一个数据帧的整个列中的字符串来检索一个数据帧中的值? - How to retrieve value in one data frame by matching a string within an entire column from another data frame? 如何“合并”长格式数据框中的重复列值 - How to "merge" repeated column values in a long-format data frame 按特定列中最常见的值对数据框进行排序 - Order data frame by most common value in specific column 具有重复列名的整洁 data.frame - Tidy data.frame with repeated column names 当重复两列时,从数据帧中删除条目,将第三列中具有最高值的条目保留 - Remove entries from data frame when two columns are repeated, retaining the one with highest value in a third column 如何删除具有单个值的数据框列 - How to remove data frame column with a single value
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM