[英]apply function on all subsets of dataframe
how can i normalize the values of Sepal.Length by Species? 我怎样才能使物种的Sepal.Length值正常化?
iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
...
# i have to divide by
tapply(iris$Sepal.Length, iris$Species, max)
setosa versicolor virginica
5.8 7.0 7.9
in other words i want to divide all values where Species=="setosa"
by 5.8 and so on finally i want to have a data frame with normalized values 0..1 in the Sepal.Length column. 换句话说,我想将Species=="setosa"
的所有值除以5.8,依此类推,最后我希望在Sepal.Length列中有一个标准化值为0..1的数据框。
Finally it should return 最后应该回归
iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 0.8793103 3.5 1.4 0.2 setosa
...
Obviously there are a ton of ways to do this. 显然,有很多方法可以做到这一点。 I like the syntax of ave()
(see DWin's answer) or the data.table
package best: 我最喜欢ave()
的语法(参见DWin的答案)或data.table
包的语法:
library(data.table)
dt <- data.table(iris)
dt[, Sepal.Length:=(Sepal.Length)/max(Sepal.Length), by="Species"]
dt
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 0.8793103 3.5 1.4 0.2 setosa
# 2: 0.8448276 3.0 1.4 0.2 setosa
# 3: 0.8103448 3.2 1.3 0.2 setosa
# 4: 0.7931034 3.1 1.5 0.2 setosa
# 5: 0.8620690 3.6 1.4 0.2 setosa
# 146: 0.8481013 3.0 5.2 2.3 virginica
# 147: 0.7974684 2.5 5.0 1.9 virginica
# 149: 0.7848101 3.4 5.4 2.3 virginica
# 150: 0.7468354 3.0 5.1 1.8 virginica
df <- data.frame(dt) ## It's possible (but not necessary) to coerce back to
## a plain old data.frame
I'm strictly interpreting your desire to divide by the max values. 我严格地解释了你想要除以最大值的愿望。
One option: 一种选择:
aggregate(iris$Sepal.Length,list(iris$Species),FUN = function(x) x/max(x))
and another, using ddply
from plyr (and scales all the columns at once: 而另一个,使用ddply
从plyr(和缩放所有列在一次:
ddply(iris,.(Species),colwise(function(x){x / max(x)}))
And a variant more like @Dwin's ave
example, that keep the other columns the same, but using ddply
: 而且更像@Dwin的ave
示例,保持其他列相同,但使用ddply
:
ddply(iris,.(Species),transform,Sepal.Length = Sepal.Length / max(Sepal.Length))
iris$ratio_to_max <- ave( iris$Sepal.Length, list(iris$Species),
FUN= function(x) x/max(x))
#-------------
> str(iris)
'data.frame': 150 obs. of 6 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ratio_to_max: num 0.879 0.845 0.81 0.793 0.862 ...
If you wanted to replace the Sepal.Length
column you could do so, but I generally avoid such destructive practice until I am really sure I got what I wanted. 如果你想替换Sepal.Length
专栏你可以这样做,但我通常会避免这种破坏性练习,直到我确信我得到了我想要的东西。 (And even then I feel guilty.) If you wanted this to be in separated list "packets" and throw away the original "Sepal.Length" column, you could use split
: (即便如此,我也感到内疚。)如果您希望将其放在单独的列表“数据包”中并丢弃原始的“Sepal.Length”列,则可以使用split
:
spl.iris <- split(iris[-1], iris$Species)
str(spl.iris)
I'm sure there's way better plyr or data table or even base ways: 我确信有更好的plyr或数据表甚至基本方式:
L1 <- lapply(split(iris[, -5], iris$Species), function(x) apply(x, 2, scale))
L2 <- lapply(seq_along(L1), function(i) {
data.frame(SPecies=names(L1)[i], L1[[i]])
})
do.call(rbind, L2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.