简体   繁体   English

如何通过分类变量过滤R中的data.frame?

[英]How do I filter a data.frame in R by categorical variable?

Just learning R. 只是学习R。

Given a data.frame in R with two columns, one numeric and one categorical, how do I extract a portion of the data.frame for usage? 给定R中的data.frame有两列,一列为数字,一列为类别,如何提取data.frame的一部分以供使用?

str(ex0331)
'data.frame':   36 obs. of  2 variables:
$ Iron      : num  0.71 1.66 2.01 2.16 2.42 ...
$ Supplement: Factor w/ 2 levels "Fe3","Fe4": 1 1 1 1 1 1 1 1 1 1 ...

Basically, I need to be able to operate on the two factors separately; 基本上,我需要能够分别对这两个因素进行操作。 ie I need the ability to individually determine length/mean/sd/etc of the Iron retention rate by Supplement type ( Fe3 or Fe4 ). 即我需要能够通过Supplement类型( Fe3Fe4 )单独确定铁保留率的长度/平均值/ sd /等。

What's the easiest way to accomplish this? 最简单的方法是什么?

I'm aware of the by() command. 我知道by()命令。 For example, the following gets some of what I need: 例如,以下内容获取了我需要的一些内容:

by(ex0331, ex0331$Supplement, summary)
ex0331$Supplement: Fe3
     Iron       Supplement
Min.   :0.710   Fe3:18    
1st Qu.:2.420   Fe4: 0    
Median :3.475             
Mean   :3.699             
3rd Qu.:4.472             
Max.   :8.240             
------------------------------------------------------------ 
ex0331$Supplement: Fe4
     Iron        Supplement
Min.   : 2.200   Fe3: 0    
1st Qu.: 3.892   Fe4:18    
Median : 5.750             
Mean   : 5.937             
3rd Qu.: 6.970             
Max.   :12.450      

But I need more flexibility. 但是我需要更多的灵活性。 I need to apply axis commands, for example, or log() functions by group. 我需要应用axis命令,例如,或者按组应用log()函数。 I'm sure there's an easy way to do this; 我敢肯定有一个简单的方法可以做到这一点; I just don't see it. 我只是看不到。 All of the data.frame manipulation documentation I've seen is for numerical rather than categorical variables. 我见过的所有data.frame操作文档都是针对数字变量的,而不是针对类别变量的。

I'd recommend using ddply function from the plyr package, detailed doc is online: 我推荐使用ddply函数从plyr包,详细的文档在线:

> require(plyr)
> ddply( ex0331, .(Supplement), summarise, 
         mean = mean(Iron), 
         sd = sd(Iron), 
         len = length(Iron))

  Supplement       mean        sd len
1        Fe3 -0.3749169 0.2827360   4
2        Fe4  0.1953116 0.7128129   6

Update . 更新 To add a LogIron column where each entry is the log() of the Iron value, you would simply use transform : 要添加一个LogIron列,其中每个条目都是Iron值的log() ,只需使用transform

> transform(ex0331, LogIron = log(Iron))

         Iron Supplement     LogIron
1  0.07185141        Fe3 -2.63315498
2  1.10367297        Fe3  0.09864368
3  0.48592428        Fe3 -0.72170246
4  0.20286918        Fe3 -1.59519393
5  0.80830682        Fe4 -0.21281357

Or, to create a summary that is the "mean of the log Iron values, per Supplement", you would do: 或者,要创建一个摘要,即“每个补品中日志铁值的平均值”,您可以执行以下操作:

> ddply( ex0331, .(Supplement), summarise, meanLog = mean(log(Iron)))
  Supplement    meanLog
1        Fe3 -1.0062304
2        Fe4  0.2791507

You can get a subset of your data by indexing or using subset : 您可以通过索引或使用subset来获取数据的subset

ex0331 <- data.frame( iron=rnorm(36), supplement=c("Fe3","Fe4"))

subset(ex0331, supplement=="Fe3")
subset(ex0331, supplement=="Fe4")

ex0331[ex0331$supplement=="Fe3",]

Or at once with split , resulting in a list: 或一次使用split ,生成一个列表:

split(ex0331,ex0331$supplement)

Another thing you can do is use tapply to split by a factor and then perform a function: 您可以做的另一件事是使用tapply按一个因子进行拆分,然后执行一个功能:

tapply(ex0331$iron,ex0331$supplement,mean)
        Fe3         Fe4 
-0.15443861 -0.01308835 

The plyr package can also be used, which has loads of useful functions. 也可以使用plyr软件包,它具有许多有用的功能。 For example: 例如:

library(plyr)
daply(ex0331,.(supplement),function(x)mean(x[1]))
        Fe3         Fe4 
-0.15443861 -0.01308835 

Edit 编辑

In response to edited question, you could get the log of iron per supplement with: 回答编辑后的问题,您可以通过以下方式获取每种补品的铁含量:

ex0331 <- data.frame( iron=abs(rnorm(36)), supplement=c("Fe3","Fe4"))

tapply(ex0331$iron,ex0331$supplement,log)

Or with plyr : 或与plyr

library(plyr)
dlply(ex0331,.(supplement),function(x)log(x$iron))

Both returned in a list. 两者都以列表形式返回。 I'm sure there is an easier way then the wrapper function in the plyr example though. 我敢肯定,还有比plyr示例中的wrapper函数更简单的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM