简体   繁体   English

对R中数据框的一列进行计数

[英]Doing counts for a column of a dataframe in R

I have a dataframe "samp" with a column (let's call it "rating") which takes on several values (let's say one of the following: "good", "medium", "bad".) 我有一个带有一列的数据框“ samp”(我们称其为“ rating”),该列具有多个值(假设以下值之一:“ good”,“ medium”,“ bad”。)

I would like to group-by on several other columns and count the frequency of "good", "medium" and "bad" and report those frequencies in new columns. 我想对其他几列进行分组,计算“好”,“中”和“坏”的频率,并在新列中报告这些频率。 (So maybe col1 is movie year, col2 is genre, and then there should be three more columns telling you how many of each type of rating there were for each year and genre.) (因此,col1是电影年份,col2是流派,然后应该再增加三列,告诉您每种年份和流派的每种分级类型有多少。)

 ddply(samp,c("col1","col2"), summarize, 
       good=table(samp$rating)["good"],
       medium=table(samp$rating)["medium"],
       bad=table(samp$rating)["bad"])

The problem is (I think) that the functions I'm defining are not in terms of the groups ddply is outputting, they are just constant functions of samp. 问题是(我认为)我定义的函数不是根据ddply输出的组,它们只是samp的恒定函数。 How can I define the functions here so that they're functions of the groups? 如何在这里定义功能,使其成为组的功能?

I tried using an anonymous function: 我尝试使用匿名函数:

 ddply(samp,c("col1","col2"), summarize, 
       good=function(df)table(df$rating)["good"],
       medium=function(df)table(df$rating)["medium"],
       bad=function(df)table(df$rating)["bad"])

I can never get it working though. 我永远无法使它正常工作。 I think the error I've gotten the most from this is 我认为我从中得到的最大错误是

 Error in output[[var]][rng] <- df[[var]] : 
 incompatible types (from closure to logical) in subassignment type fix

So lay it on me. 所以把它放在我身上。 What's the ridiculously simple solution that did not turn up while I blundered around trying 948506 combinations of ddply and table? 当我尝试尝试ddply和table的948506组合时出现错误时,没有出现的可笑的简单解决方案是什么? Thank you. 谢谢。

Just remove all instances of samp$ inside ddply and it will work: 只需删除ddply中所有samp$实例,它就会起作用:

ddply(samp,c("col1","col2"), summarize, 
  good=table(rating)["good"],
  medium=table(rating)["medium"],
  bad=table(rating)["bad"])

Generic data: 通用数据:

samp <- data.frame(rating=c("bad","medium","good","bad","medium","good"),
                   col1=c(2007,2010,2007,2009,2010,2010),
                   col2=c("fiction","fiction","fiction","drama","drama","drama"))

Code (you shouldn't use samp$ before columns' names): 代码(您不应在列名称前使用samp$ ):

ddply(samp,c("col1","col2"), summarize, 
      good=sum(rating == "good"),
      medium=sum(rating == "medium"),
      bad=sum(rating == "bad"))

Output: 输出:

  col1    col2 good medium bad
1 2007 fiction    1      0   1
2 2009   drama    0      0   1
3 2010   drama    1      1   0
4 2010 fiction    0      1   0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM