[英]How do I filter a data.frame in R by categorical variable?
Just learning R. 只是学习R。
Given a data.frame
in R with two columns, one numeric and one categorical, how do I extract a portion of the data.frame
for usage? 给定R中的
data.frame
有两列,一列为数字,一列为类别,如何提取data.frame
的一部分以供使用?
str(ex0331)
'data.frame': 36 obs. of 2 variables:
$ Iron : num 0.71 1.66 2.01 2.16 2.42 ...
$ Supplement: Factor w/ 2 levels "Fe3","Fe4": 1 1 1 1 1 1 1 1 1 1 ...
Basically, I need to be able to operate on the two factors separately; 基本上,我需要能够分别对这两个因素进行操作。 ie I need the ability to individually determine length/mean/sd/etc of the Iron retention rate by
Supplement
type ( Fe3
or Fe4
). 即我需要能够通过
Supplement
类型( Fe3
或Fe4
)单独确定铁保留率的长度/平均值/ sd /等。
What's the easiest way to accomplish this? 最简单的方法是什么?
I'm aware of the by()
command. 我知道
by()
命令。 For example, the following gets some of what I need: 例如,以下内容获取了我需要的一些内容:
by(ex0331, ex0331$Supplement, summary)
ex0331$Supplement: Fe3
Iron Supplement
Min. :0.710 Fe3:18
1st Qu.:2.420 Fe4: 0
Median :3.475
Mean :3.699
3rd Qu.:4.472
Max. :8.240
------------------------------------------------------------
ex0331$Supplement: Fe4
Iron Supplement
Min. : 2.200 Fe3: 0
1st Qu.: 3.892 Fe4:18
Median : 5.750
Mean : 5.937
3rd Qu.: 6.970
Max. :12.450
But I need more flexibility. 但是我需要更多的灵活性。 I need to apply
axis
commands, for example, or log()
functions by group. 我需要应用
axis
命令,例如,或者按组应用log()
函数。 I'm sure there's an easy way to do this; 我敢肯定有一个简单的方法可以做到这一点; I just don't see it.
我只是看不到。 All of the
data.frame
manipulation documentation I've seen is for numerical rather than categorical variables. 我见过的所有
data.frame
操作文档都是针对数字变量的,而不是针对类别变量的。
I'd recommend using ddply
function from the plyr
package, detailed doc is online: 我推荐使用
ddply
函数从plyr
包,详细的文档在线:
> require(plyr)
> ddply( ex0331, .(Supplement), summarise,
mean = mean(Iron),
sd = sd(Iron),
len = length(Iron))
Supplement mean sd len
1 Fe3 -0.3749169 0.2827360 4
2 Fe4 0.1953116 0.7128129 6
Update . 更新 。 To add a
LogIron
column where each entry is the log()
of the Iron
value, you would simply use transform
: 要添加一个
LogIron
列,其中每个条目都是Iron
值的log()
,只需使用transform
:
> transform(ex0331, LogIron = log(Iron))
Iron Supplement LogIron
1 0.07185141 Fe3 -2.63315498
2 1.10367297 Fe3 0.09864368
3 0.48592428 Fe3 -0.72170246
4 0.20286918 Fe3 -1.59519393
5 0.80830682 Fe4 -0.21281357
Or, to create a summary that is the "mean of the log Iron values, per Supplement", you would do: 或者,要创建一个摘要,即“每个补品中日志铁值的平均值”,您可以执行以下操作:
> ddply( ex0331, .(Supplement), summarise, meanLog = mean(log(Iron)))
Supplement meanLog
1 Fe3 -1.0062304
2 Fe4 0.2791507
You can get a subset of your data by indexing or using subset
: 您可以通过索引或使用
subset
来获取数据的subset
:
ex0331 <- data.frame( iron=rnorm(36), supplement=c("Fe3","Fe4"))
subset(ex0331, supplement=="Fe3")
subset(ex0331, supplement=="Fe4")
ex0331[ex0331$supplement=="Fe3",]
Or at once with split
, resulting in a list: 或一次使用
split
,生成一个列表:
split(ex0331,ex0331$supplement)
Another thing you can do is use tapply
to split by a factor and then perform a function: 您可以做的另一件事是使用
tapply
按一个因子进行拆分,然后执行一个功能:
tapply(ex0331$iron,ex0331$supplement,mean)
Fe3 Fe4
-0.15443861 -0.01308835
The plyr
package can also be used, which has loads of useful functions. 也可以使用
plyr
软件包,它具有许多有用的功能。 For example: 例如:
library(plyr)
daply(ex0331,.(supplement),function(x)mean(x[1]))
Fe3 Fe4
-0.15443861 -0.01308835
In response to edited question, you could get the log of iron per supplement with: 回答编辑后的问题,您可以通过以下方式获取每种补品的铁含量:
ex0331 <- data.frame( iron=abs(rnorm(36)), supplement=c("Fe3","Fe4"))
tapply(ex0331$iron,ex0331$supplement,log)
Or with plyr
: 或与
plyr
:
library(plyr)
dlply(ex0331,.(supplement),function(x)log(x$iron))
Both returned in a list. 两者都以列表形式返回。 I'm sure there is an easier way then the wrapper function in the plyr example though.
我敢肯定,还有比plyr示例中的wrapper函数更简单的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.