简体   繁体   English

计算数据框中几列的平均值(和加权平均值)

[英]Calculate the mean (and weighted mean) of several columns within a dataframe

Summary概括

Provided a dataframe in which I have several columns that are variables (each of them being numeric but one, which is a factor) and rows are observations,I would like to create a new column with the mean of all numeric columns + another one with a weighted mean of all numeric columns.提供了一个数据框,其中我有几列是变量(每列都是数字,但有一个是因子),行是观察值,我想创建一个新列,其中包含所有数字列的平均值 + 另一列所有数字列的加权平均值。

I have found quite some ways that apparently solve this problem (using dplyr , lapply , data.table ... ) but none of them work with wide dataframes (and I am not sure I can convert it to long format -see below, and please be patient before marking as duplicate, as I haven't found any answer to my problem).我找到了很多显然可以解决这个问题的方法(使用dplyrlapplydata.table ... ),但它们都不适用于宽数据data.table (我不确定我是否可以将其转换为长格式 - 见下文,以及在标记为重复之前请耐心等待,因为我还没有找到我的问题的任何答案)。

Long version:长版:

I have a dataframe in wide format like the one provided below (the original one has more than 1700 observations of 20 variables grouped into 30 neighbourhoods) that is the result of of calculating the median of values of each variable:我有一个像下面提供的那样的宽格式数据框(原始数据框有 20 个变量的 1700 多个观察值,分为 30 个邻域),这是计算每个变量值的中值的结果:

df = data.frame(matrix(rnorm(15), nrow = 3))
df$neighbour = c("neighbour1", "neighbour2", "neighbour3")

df
> df
          X1         X2         X3         X4        X5  neighbour
1  1.0384405  0.6116994 -0.2075835  0.3206011 1.3855455 neighbour1
2 -0.5115649 -0.7722500  0.8374265 -1.3697758 0.1690452 neighbour2
3  1.0145282  0.6809156 -0.2918737  0.2912297 1.0689213 neighbour3

I would like to create我想创造

  • 1) a column named mean that is the mean of all numeric values (all columns but neighbour ) and 1) 一个名为mean的列,它是所有数值的平均值(除neighbour所有列)和
  • 2) a wmean column with is the weighted mean of each column, where the weight is provided by the following vector: weight = c(.25, .05, .3, .3, .3) 2) wmean列是每列的加权平均值,其中权重由以下向量提供: weight = c(.25, .05, .3, .3, .3)

My first attempt was using dplyr::mutate to create those columns, but I haven't succeeded, most likely because I'm doing it wrong (So If I haven't succeeded with a regular mean, I have no clue of how to perform a weighted mean):我的第一次尝试是使用dplyr::mutate来创建这些列,但我没有成功,很可能是因为我做错了(所以如果我没有用常规平均值成功,我不知道如何执行加权平均):

df = df %>%
  mutate(mean = mean(select(-neighbour)))
Error in mutate_impl(.data, dots) : 
  argumento no válido para un operador unitario
> df = df %>%
+   mutate(mean = mean())
Error in mutate_impl(.data, dots) : 
  el argumento "x" está ausente, sin valor por omisión
> df = df %>%
+   mutate(mean = mean(is.numeric()))
Error in mutate_impl(.data, dots) : 
  0 arguments passed to 'is.numeric' which requires 1
> 

Also tried with mutate_each , but I'm assuming that my problem is that I do not know how to pass the right columns to calculate the mean (not to mention that I have no clue about weighted mean).也尝试过mutate_each ,但我假设我的问题是我不知道如何传递正确的列来计算平均值(更不用说我对加权平均值一无所知)。

From what I have read there are many ways to create the desired columns:根据我的阅读,有很多方法可以创建所需的列:

  • This answer by Carlos Cinelli gives examples using sapply + filter , dplyr and tydr , but all these solutions are based on the fact they do not create a new column, with the median of each neighbour's observations but the median of each variables' values. Carlos sapply + filter这个答案给出了使用sapply + filterdplyrtydr ,但所有这些解决方案都基于这样一个事实,它们不会创建一个新列,每个邻居的观察值的中值,而是每个变量值的中值。

  • This answer by @Roland suggests to use data.table, but in order to be able to use it, my dataframe should have a column with the weight (whereas I do not have it and I'm afraid I wouldn't know how to create a column like that, provided that I have more than 1700 observations) @Roland 的这个答案建议使用 data.table,但为了能够使用它,我的数据框应该有一个带有权重的列(而我没有它,恐怕我不知道如何创建一个这样的列,前提是我有超过 1700 个观察值)

  • This answer by @Bob uses apply to create a mean of several columns (that's close to what I'm loooking for!) but still no clue of how to A) exclude the neighbour column, as otherwise it will fail, and B) to calculate the weighted mean. @Bob 的这个答案使用apply创建几列的平均值(这与我正在寻找的很接近!)但仍然不知道如何 A) 排除neighbour列,否则它将失败,并且 B) 到计算加权平均值。

Can anyone bring me some light with it?任何人都可以给我带来一些光明吗? I am so ofuscated right now trying to solve this that I can't see the answer.我现在正试图解决这个问题,以至于我看不到答案。

EDIT: As per @boshek's answer I have tried to convert from wide to long format and then applying summarise_each, but haven't succeeded neither:编辑:根据@boshek 的回答,我尝试从宽格式转换为长格式,然后应用 summarise_each,但都没有成功:

df = df %>%
  gather(variable, value, -neighbour) %>%
  group_by(neighbour, variable) %>%
  summarise_each(., funs=mean)

Ok - so you want means ACROSS the row?好的 - 所以你想要的意思是整行?

I'd use gather from dplyr then merge it back with your original data:我会使用dplyr gather然后将其与您的原始数据合并:

df.mean <- df %>%
  gather(variable, value, -neighbour) %>%
  group_by(neighbour) %>%
  summarise(mean_value=mean(value), wmean_value=weighted.mean(value))

df.comb <- df %>%
  full_join(.,df.mean, by=c("neighbour"))

There are a few ways to skin this cat but this is one.有几种方法可以给这只猫剥皮,但这是一种。

Is this what you wanted?这是你想要的吗?

df$mean <- apply(df[1:5], 1, mean)
df$wt.mean <- apply(df[1:5], 1, weighted.mean, weight)

I think the rowMeans() function in base may be your best bet.我觉得在rowMeans()函数base可能是你最好的选择。

df$mean <- rowMeans(dplyr::select(df, starts_with("X")))

The weighted mean may be more difficult.加权平均可能更难。 I couldn't find a quick and clean way to do it, but here's an option that works:我找不到一种快速而干净的方法来做到这一点,但这里有一个可行的选择:

# define a function that calculates a weighted mean
wmean <- function(x, weight){
  stopifnot(length(x) == length(weight))
  if(sum(weight) != 1) {
    message("Rescaling weights to sum to 1")
    weight <- weight/sum(weight)
  }
  wx <- sum(x * weight)
  return(wx)
}
# apply that function row by row to the X columns in df
df$wmean <- apply(X=dplyr::select(df, starts_with("X")), MARGIN=1, FUN=wmean, weight = weight)

I know I'm a bit late posting this, but I was looking for a solution to a similar problem and found the rowWeightedMeans from the matrixStats library, wich also supports na.rm , you only need to convert to matrix, it works as follows:我知道我发布这个有点晚了,但我正在寻找类似问题的解决方案,并从matrixStats库中找到了rowWeightedMeans ,它也支持na.rm ,你只需要转换为矩阵,它的工作原理如下:

library(matrixStats)
df$wmean <- rowWeightedMeans(as.matrix(df[ , c('X1', 'X2', 'X3', 'X4', 'X5')]), w = weight)

This worked perfectly for me and as mentioned, has the extra that supports na.rm = TRUE wich I needed这对我来说非常有效,并且如上所述,有额外的支持na.rm = TRUE我需要

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM