简体   繁体   English

按因子列汇总混合数据

[英]Aggregating mixed data by factor column

For the past week I have been trying to aggregate my dataset that consists of different weight measurements in different months accompanied by a large volume of background variables in R. 在过去一周中,我一直在尝试汇总我的数据集,该数据集由不同月份的不同权重测量以及R中的大量背景变量组成。

I have read many different asked questions on this topic (ie R aggregate data by defining grouping , How to aggregate count of unique values of categorical variables in R ), but they all seem to either only work with one type of data or are only interested in one column. 我已经阅读了有关此主题的许多不同的问题(即R通过定义分组 聚合数据如何对R中的分类变量的唯一值进行计数 ),但是它们似乎都只能使用一种类型的数据,或者只感兴趣一栏。 Specifically, question Recoding categorical variables to the most common value deals with almost exactly the same problem, but the proposed answer only fixes the problem for the categorical data, it does not include the numeric data as well. 具体来说,将类别变量重新编码为最常见的值的问题几乎解决了相同的问题,但是建议的答案仅针对类别数据解决了该问题,它也不包括数值数据。 My data consist of both factors(categorical and ordinal) and numeric data. 我的数据包括因素(分类和有序)和数值数据。

The reproducible example is: 可重现的示例是:

IDnumber <- c("1", "1", "1", "2", "2", "3", "3", "3")
Gender <- c("Male", "Male", "Male", "Female", "Female", "Female", "Female",  "Female")
Weight <- c(80, 82, 82, 70, 66, 54, 50, 52)
LikesSoda <- c("Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", NA)
df = data.frame(IDnumber, Gender, Weight, LikesSoda)

My output dataframe would take the mean of each numerical column, and the most frequent factor for each factor column. 我的输出数据框将采用每个数值列的平均值,以及每个因子列的最常见因子。 In the example this would look as following: 在示例中,该代码如下所示:

IDnumber <- c("1", "2", "3")
Gender <- c("Male", "Female", "Female")
Weight <- c(81.5, 78, 52)
LikesSoda <- c("No", "Yes", "Yes")
output = data.frame(IDnumber, Gender, Weight, LikesSoda)

So far I've tried to split the dataframe into a factor dataframe and numeric dataframe and use two aggregates with a different function (mean for the numeric, but I've not been able to find a working function for the categorical data). 到目前为止,我已经尝试将数据框分为因子数据框和数值数据框,并使用两个具有不同函数的聚合(平均值表示数值,但我无法找到分类数据的有效函数)。 The other option is to use a dplyr df &>& group_by(IDnumber) %>% summarise( transformation for each variable ) code, but that requires me to specify how to handle each column manually. 另一个选择是使用dplyr df &>& group_by(IDnumber) %>% summarise( transformation for each variable )代码,但这需要我指定如何手动处理每列。 Since I have over 2500 columns, this does not seem like a workable solution. 由于我有超过2500列,因此这似乎不是可行的解决方案。

You could write your own functions and then use lapply . 您可以编写自己的函数,然后使用lapply First, write a function to find the most frequent level in a factor variable 首先,编写一个函数以查找因子变量中最频繁的级别

getmode <- function(v) {
  levels(v)[which.max(table(v))]
}

Then write a function to return either the mean or mode depending on the type of variable passed to it 然后编写一个函数以根据传递给它的变量类型返回均值或众数

my_summary <- function(x, id, ...){
  if (is.numeric(x)) {
    return(tapply(x, id, mean))
  }  
  if (is.factor(x)) {
    return(tapply(x, id, getmode))
  }  
}

Finally, use lapply to calculate the summaries 最后,使用lapply计算汇总

data.frame(lapply(df, my_summary, id = df$IDnumber))
  IDnumber Gender   Weight LikesSoda
1        1   Male 81.33333        No
2        2 Female 68.00000       Yes
3        3 Female 52.00000       Yes

If there might be two or more levels in a factor with the same, maximum frequency then which.max will just return the first one. 如果一个因子中可能有两个或多个级别具有相同的最大频率,则which.max将仅返回第一个。 I understand from your comment that you just want to know how many of them there are, so one option might be to amend the getmode function slightly, so it adds an asterisk to the level when there is a tie: 我从您的评论中了解到,您只想知道其中有多少个,所以一个选择可能是稍微修改getmode函数,以便在出现平局时在级别上添加一个星号:

getmode <- function(v) {
  tab <- table(v)
  if (sum(tab %in% max(tab)) > 1)  return(paste(levels(v)[which.max(tab)], '*'))
  levels(v)[which.max(tab)]
}

(Changing your sample data so there is one Female and one Male with IDnumber == "2") (更改样本数据,以使ID号==“ 2”的只有一位女性和一位男性)

data.frame(lapply(df, my_summary, id = df$IDnumber))

  IDnumber   Gender   Weight LikesSoda
1        1     Male 81.33333        No
2        2 Female * 68.00000       Yes
3        3   Female 52.00000       Yes

I'm afraid that's a bit of a messy 'solution', but if you just want to get an idea of how common that issue is, perhaps it will be sufficient for your needs. 恐怕这是一个凌乱的“解决方案”,但是如果您只是想了解该问题的普遍性,也许就足以满足您的需求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM