[英]calculate mean for multiple columns in data.frame
Just wondering whether it is possible to calculate means for multiple columns by just using the mean function只是想知道是否可以仅使用 mean 函数来计算多列的均值
eg例如
mean(iris[,1])
is possible but not有可能但不是
mean(iris[,1:4])
tried:试过:
mean(iris[,c(1:4)])
got this error message:收到此错误消息:
Warning message: In mean.default(iris[, 1:4]) : argument is not numeric or logical: returning NA警告消息:在 mean.default(iris[, 1:4]) 中:参数不是数字或逻辑:返回 NA
I know I can just use lapply(iris[,1:4],mean) or sapply(iris[,1:4],mean)我知道我可以只使用 lapply(iris[,1:4],mean) 或 sapply(iris[,1:4],mean)
Try colMeans
:尝试colMeans
:
But the column must be numeric.但该列必须是数字。 You can add a test for it for larger datasets.您可以为更大的数据集添加测试。
colMeans(iris[sapply(iris, is.numeric)])
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.057333 3.758000 1.199333
Benchmark基准
Seems long for dplyr
and data.table
. dplyr
和data.table
似乎很长。 Perhaps someone can replicate the findings for veracity.也许有人可以复制这些发现的准确性。
microbenchmark(
plafort = colMeans(big.df[sapply(big.df, is.numeric)]),
Carlos = colMeans(Filter(is.numeric, big.df)),
Cdtable = big.dt[, lapply(.SD, mean)],
Cdplyr = big.df %>% summarise_each(funs(mean))
)
#Unit: milliseconds
# expr min lq mean median uq max
# plafort 9.862934 10.506778 12.07027 10.699616 11.16404 31.23927
# Carlos 9.215143 9.557987 11.30063 9.843197 10.21821 65.21379
# Cdtable 57.157250 64.866996 78.72452 67.633433 87.52451 264.60453
# Cdplyr 62.933293 67.853312 81.77382 71.296555 91.44994 182.36578
Data数据
m <- matrix(1:1e6, 1000)
m2 <- matrix(rep('a', 1000), ncol=1)
big.df <- as.data.frame(cbind(m2, m), stringsAsFactors=F)
big.df[,-1] <- lapply(big.df[,-1], as.numeric)
big.dt <- as.data.table(big.df)
With sapply
+ Filter
:使用sapply
+ Filter
:
sapply(Filter(is.numeric, iris), mean)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.057333 3.758000 1.199333
With dplyr
:使用dplyr
:
library(dplyr)
iris %>% summarise_each(funs(mean))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.843333 3.057333 3.758 1.199333 NA
PS: in dplyr
you can now use summarize_if
, PS:在dplyr
您现在可以使用summarize_if
,
iris %>% summarise_if(is.numeric, mean)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.843333 3.057333 3.758 1.199333
With data.table
:使用data.table
:
library(data.table)
iris <- data.table(iris)
iris[,lapply(.SD, mean)]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.843333 3.057333 3.758 1.199333 NA
Your above solution does work assuming the columns are in the correct is.numeric format.假设列采用正确的 is.numeric 格式,您的上述解决方案确实有效。 See below example:见下面的例子:
a <- c(1,2,3)
mean(a)
b <- c(2,4,6)
mean(b)
d <- c(3,6,9)
mydata <- cbind(b,a,d)
mean(mydata[,1:3])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.