[英]How to compute the NAs with the column mean and then multiply columns of different lengths in R?
[英]Group by columns, then compute mean and sd of every other column in R
如何按列分組,然后計算R中每個其他列的平均值和標准差?
例如,考慮着名的Iris數據集。 我想做一些類似物種分組的事情,然后計算花瓣/萼片長度/寬度測量值的平均值和sd。 我知道這與split-apply-combine有關,但我不知道如何從那里開始。
我能想出什么:
require(plyr)
x <- ddply(iris, .(Species), summarise,
Sepal.Length.Mean = mean(Sepal.Length),
Sepal.Length.Sd = sd(Sepal.Length),
Sepal.Width.Mean = mean(Sepal.Width),
Sepal.Width.Sd = sd(Sepal.Width),
Petal.Length.Mean = mean(Petal.Length),
Petal.Length.Sd = sd(Petal.Length),
Petal.Width.Mean = mean(Petal.Width),
Petal.Width.Sd = sd(Petal.Width))
Species Sepal.Length.Mean Sepal.Length.Sd Sepal.Width.Mean Sepal.Width.Sd
1 setosa 5.006 0.3524897 3.428 0.3790644
2 versicolor 5.936 0.5161711 2.770 0.3137983
3 virginica 6.588 0.6358796 2.974 0.3224966
Petal.Length.Mean Petal.Length.Sd Petal.Width.Mean Petal.Width.Sd
1 1.462 0.1736640 0.246 0.1053856
2 4.260 0.4699110 1.326 0.1977527
3 5.552 0.5518947 2.026 0.2746501
期望的輸出:
z <- data.frame(setosa = c(5.006, 0.3524897, 3.428, 0.3790644,
1.462, 0.1736640, 0.246, 0.1053856),
versicolor = c(5.936, 0.5161711, 2.770, 0.3137983,
4.260, 0.4699110, 1.326, 0.1977527),
virginica = c(6.588, 0.6358796, 2.974, 0.3225966,
5.552, 0.5518947, 2.026, 0.2746501))
rownames(z) <- c('Sepal.Length.Mean', 'Sepal.Length.Sd',
'Sepal.Width.Mean', 'Sepal.Width.Sd',
'Petal.Length.Mean', 'Petal.Length.Sd',
'Petal.Width.Mean', 'Petal.Width.Sd')
setosa versicolor virginica
Sepal.Length.Mean 5.0060000 5.9360000 6.5880000
Sepal.Length.Sd 0.3524897 0.5161711 0.6358796
Sepal.Width.Mean 3.4280000 2.7700000 2.9740000
Sepal.Width.Sd 0.3790644 0.3137983 0.3225966
Petal.Length.Mean 1.4620000 4.2600000 5.5520000
Petal.Length.Sd 0.1736640 0.4699110 0.5518947
Petal.Width.Mean 0.2460000 1.3260000 2.0260000
Petal.Width.Sd 0.1053856 0.1977527 0.2746501
我們可以試試dplyr
library(dplyr)
res <- iris %>%
group_by(Species) %>%
summarise_each(funs(mean, sd))
`colnames<-`(t(res[-1]), as.character(res$Species))
# setosa versicolor virginica
#Sepal.Length_mean 5.0060000 5.9360000 6.5880000
#Sepal.Width_mean 3.4280000 2.7700000 2.9740000
#Petal.Length_mean 1.4620000 4.2600000 5.5520000
#Petal.Width_mean 0.2460000 1.3260000 2.0260000
#Sepal.Length_sd 0.3524897 0.5161711 0.6358796
#Sepal.Width_sd 0.3790644 0.3137983 0.3224966
#Petal.Length_sd 0.1736640 0.4699110 0.5518947
#Petal.Width_sd 0.1053856 0.1977527 0.2746501
或者正如@Steven Beaupre在評論中提到的那樣,可以通過使用spread
進行重新整形來獲得輸出
library(tidyr)
iris %>%
group_by(Species) %>%
summarise_each(funs(mean, sd)) %>%
gather(key, value, -Species) %>%
spread(Species, value)
這是傳統的plyr
方法。 它使用colwise
來計算所有列的摘要統計信息。
means <- ddply(iris, .(Species), colwise(mean))
sds <- ddply(iris, .(Species), colwise(sd))
merge(means, sds, by = "Species", suffixes = c(".mean", ".sd"))
如果你想使用data.table
出於性能原因你可以試試這個(不要害怕 - 比代碼更多的評論;-)我試圖優化所有性能關鍵點。
library(data.table)
dt <- as.data.table(iris)
# Helper function similar to "colwise" of package "plyr":
# Apply a function "func" to each column of the data.table "data"
# and append the "suffix" string to the result column name.
colwise.dt <- function( data, func, suffix )
{
result <- lapply(data, func) # apply the function to each column of the data table
setDT(result) # convert the result list into a data table efficiently ("by ref")
setnames(result, names(result), paste0(names(result), suffix)) # append suffix to each column name efficiently ("by ref"). "setnames" requires a data.table
}
wide.result <- dt[, c(colwise.dt(.SD, mean, ".mean"), colwise.dt(.SD, sd, ".sd")), by=.(Species)]
# Note: .SD is a data.table containing the subset of dt's data for each group (Species), excluding any columns used in "by" (here: Species column)
# Now transpose the result
long.result <- melt(wide.result, id.vars="Species")
# Now transform into one column per group
final.result <- dcast(long.result, variable ~ Species)
wide.result
是:
Species Sepal.Length.mean Sepal.Width.mean Petal.Length.mean Petal.Width.mean Sepal.Length.sd Sepal.Width.sd Petal.Length.sd Petal.Width.sd
1: setosa 5.006 3.428 1.462 0.246 0.3524897 0.3790644 0.1736640 0.1053856
2: versicolor 5.936 2.770 4.260 1.326 0.5161711 0.3137983 0.4699110 0.1977527
3: virginica 6.588 2.974 5.552 2.026 0.6358796 0.3224966 0.5518947 0.2746501
long.result
是:
Species variable value
1: setosa Sepal.Length.mean 5.0060000
2: versicolor Sepal.Length.mean 5.9360000
3: virginica Sepal.Length.mean 6.5880000
4: setosa Sepal.Width.mean 3.4280000
5: versicolor Sepal.Width.mean 2.7700000
6: virginica Sepal.Width.mean 2.9740000
7: setosa Petal.Length.mean 1.4620000
8: versicolor Petal.Length.mean 4.2600000
9: virginica Petal.Length.mean 5.5520000
10: setosa Petal.Width.mean 0.2460000
11: versicolor Petal.Width.mean 1.3260000
12: virginica Petal.Width.mean 2.0260000
13: setosa Sepal.Length.sd 0.3524897
14: versicolor Sepal.Length.sd 0.5161711
15: virginica Sepal.Length.sd 0.6358796
16: setosa Sepal.Width.sd 0.3790644
17: versicolor Sepal.Width.sd 0.3137983
18: virginica Sepal.Width.sd 0.3224966
19: setosa Petal.Length.sd 0.1736640
20: versicolor Petal.Length.sd 0.4699110
21: virginica Petal.Length.sd 0.5518947
22: setosa Petal.Width.sd 0.1053856
23: versicolor Petal.Width.sd 0.1977527
24: virginica Petal.Width.sd 0.2746501
final.result
是:
variable setosa versicolor virginica
1: Sepal.Length.mean 5.0060000 5.9360000 6.5880000
2: Sepal.Width.mean 3.4280000 2.7700000 2.9740000
3: Petal.Length.mean 1.4620000 4.2600000 5.5520000
4: Petal.Width.mean 0.2460000 1.3260000 2.0260000
5: Sepal.Length.sd 0.3524897 0.5161711 0.6358796
6: Sepal.Width.sd 0.3790644 0.3137983 0.3224966
7: Petal.Length.sd 0.1736640 0.4699110 0.5518947
8: Petal.Width.sd 0.1053856 0.1977527 0.2746501
與期望輸出的唯一區別是final
結果包含名為variable
的第一列中的值名稱,而不是將其存儲在行名稱中。 這可以通過將行名稱設置為第一列並刪除第一列來完成...
受到答案的啟發,我找到了一個解決方案,它只能使用dplyr
和tidyr
函數。
require(tidyr)
require(dplyr)
x <- iris %>%
gather(var, value, -Species)
print(tbl_df(x))
# Compute the mean and sd for each dimension
x <- x %>%
group_by(Species, var) %>%
summarise(mean = mean(value), sd = sd(value)) %>%
ungroup
print(tbl_df(x))
# Convert the data frame from wide form to long form
x <- x %>%
gather(stat, value, mean:sd)
print(tbl_df(x))
# Combine the variables "var" and "stat" into a single variable
x <- x %>%
unite(var, var, stat, sep = '.')
print(tbl_df(x))
# Convert the data frame from long form to wide form
x <- x %>%
spread(Species, value)
print(tbl_df(x))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.