如何通过分类变量过滤R中的data.frame？

Question

Just learning R. 只是学习R。

Given a data.frame in R with two columns, one numeric and one categorical, how do I extract a portion of the data.frame for usage? 给定R中的data.frame有两列，一列为数字，一列为类别，如何提取data.frame的一部分以供使用？

str(ex0331)
'data.frame':   36 obs. of  2 variables:
$ Iron      : num  0.71 1.66 2.01 2.16 2.42 ...
$ Supplement: Factor w/ 2 levels "Fe3","Fe4": 1 1 1 1 1 1 1 1 1 1 ...

Basically, I need to be able to operate on the two factors separately; 基本上，我需要能够分别对这两个因素进行操作。 ie I need the ability to individually determine length/mean/sd/etc of the Iron retention rate by Supplement type ( Fe3 or Fe4 ). 即我需要能够通过Supplement类型（ Fe3或Fe4 ）单独确定铁保留率的长度/平均值/ sd /等。

What's the easiest way to accomplish this? 最简单的方法是什么？

I'm aware of the by() command. 我知道by()命令。 For example, the following gets some of what I need: 例如，以下内容获取了我需要的一些内容：

by(ex0331, ex0331$Supplement, summary)
ex0331$Supplement: Fe3
     Iron       Supplement
Min.   :0.710   Fe3:18    
1st Qu.:2.420   Fe4: 0    
Median :3.475             
Mean   :3.699             
3rd Qu.:4.472             
Max.   :8.240             
------------------------------------------------------------ 
ex0331$Supplement: Fe4
     Iron        Supplement
Min.   : 2.200   Fe3: 0    
1st Qu.: 3.892   Fe4:18    
Median : 5.750             
Mean   : 5.937             
3rd Qu.: 6.970             
Max.   :12.450

But I need more flexibility. 但是我需要更多的灵活性。 I need to apply axis commands, for example, or log() functions by group. 我需要应用axis命令，例如，或者按组应用log()函数。 I'm sure there's an easy way to do this; 我敢肯定有一个简单的方法可以做到这一点； I just don't see it. 我只是看不到。 All of the data.frame manipulation documentation I've seen is for numerical rather than categorical variables. 我见过的所有data.frame操作文档都是针对数字变量的，而不是针对类别变量的。

Answer 1

I'd recommend using ddply function from the plyr package, detailed doc is online: 我推荐使用ddply函数从plyr包，详细的文档在线：

> require(plyr)
> ddply( ex0331, .(Supplement), summarise, 
         mean = mean(Iron), 
         sd = sd(Iron), 
         len = length(Iron))

  Supplement       mean        sd len
1        Fe3 -0.3749169 0.2827360   4
2        Fe4  0.1953116 0.7128129   6

Update . 更新。 To add a LogIron column where each entry is the log() of the Iron value, you would simply use transform : 要添加一个LogIron列，其中每个条目都是Iron值的log() ，只需使用transform ：

> transform(ex0331, LogIron = log(Iron))

         Iron Supplement     LogIron
1  0.07185141        Fe3 -2.63315498
2  1.10367297        Fe3  0.09864368
3  0.48592428        Fe3 -0.72170246
4  0.20286918        Fe3 -1.59519393
5  0.80830682        Fe4 -0.21281357

Or, to create a summary that is the "mean of the log Iron values, per Supplement", you would do: 或者，要创建一个摘要，即“每个补品中日志铁值的平均值”，您可以执行以下操作：

> ddply( ex0331, .(Supplement), summarise, meanLog = mean(log(Iron)))
  Supplement    meanLog
1        Fe3 -1.0062304
2        Fe4  0.2791507

Answer 2

You can get a subset of your data by indexing or using subset : 您可以通过索引或使用subset来获取数据的subset ：

ex0331 <- data.frame( iron=rnorm(36), supplement=c("Fe3","Fe4"))

subset(ex0331, supplement=="Fe3")
subset(ex0331, supplement=="Fe4")

ex0331[ex0331$supplement=="Fe3",]

Or at once with split , resulting in a list: 或一次使用split ，生成一个列表：

split(ex0331,ex0331$supplement)

Another thing you can do is use tapply to split by a factor and then perform a function: 您可以做的另一件事是使用tapply按一个因子进行拆分，然后执行一个功能：

tapply(ex0331$iron,ex0331$supplement,mean)
        Fe3         Fe4 
-0.15443861 -0.01308835

The plyr package can also be used, which has loads of useful functions. 也可以使用plyr软件包，它具有许多有用的功能。 For example: 例如：

library(plyr)
daply(ex0331,.(supplement),function(x)mean(x[1]))
        Fe3         Fe4 
-0.15443861 -0.01308835

Edit 编辑

In response to edited question, you could get the log of iron per supplement with: 回答编辑后的问题，您可以通过以下方式获取每种补品的铁含量：

ex0331 <- data.frame( iron=abs(rnorm(36)), supplement=c("Fe3","Fe4"))

tapply(ex0331$iron,ex0331$supplement,log)

Or with plyr : 或与plyr ：

library(plyr)
dlply(ex0331,.(supplement),function(x)log(x$iron))

Both returned in a list. 两者都以列表形式返回。 I'm sure there is an easier way then the wrapper function in the plyr example though. 我敢肯定，还有比plyr示例中的wrapper函数更简单的方法。

如何通过分类变量过滤R中的data.frame？

问题描述

2 个解决方案

解决方案1
3 2011-02-19 18:34:26

解决方案2
3 已采纳 2011-02-19 18:35:21

Edit 编辑

如何通过分类变量过滤R中的data.frame？

问题描述

2 个解决方案

解决方案1 3 2011-02-19 18:34:26

解决方案2 3 已采纳 2011-02-19 18:35:21

Edit 编辑

解决方案1
3 2011-02-19 18:34:26

解决方案2
3 已采纳 2011-02-19 18:35:21