简体   繁体   English

如何根据最小样本大小对R中的数据帧进行子集化

[英]How do you subset a data frame in R based on a minimum sample size

Let's say you have a data frame with two levels of factors that looks like this: 假设您有一个包含两个级别因素的数据框,如下所示:

Factor1    Factor2    Value
A          1          0.75
A          1          0.34
A          2          1.21   
A          2          0.75 
A          2          0.53
B          1          0.42
B          2          0.21  
B          2          0.18
B          2          1.42

etc. 等等

How do I subset this data frame ("df", if you will) based on the condition that the combination of Factor1 and Factor2 (Fact1*Fact2) has more than, say, 2 observations? 如何根据Factor1和Factor2(Fact1 * Fact2)的组合比2个观察值更多的条件,对这个数据框(“df”,如果你愿意)进行subset Can you use the length argument in subset to do this? 你能用subsetlength参数来做到这一点吗?

library(data.table)

dt = data.table(your_df)

dt[, if(.N > 2) .SD, list(Factor1, Factor2)]
#   Factor1 Factor2 Value
#1:       A       2  1.21
#2:       A       2  0.75
#3:       A       2  0.53
#4:       B       2  0.21
#5:       B       2  0.18
#6:       B       2  1.42

You can use interaction and table to see the number of observation for each interaction (mydata is your data) and then use %in% to subset the data. 您可以使用interactiontable来查看每次交互的观察次数(mydata是您的数据),然后使用%in%来对数据进行子集化。

 mydata$inter<-with(mydata,interaction(Factor1,Factor2))
 table(mydata$inter)
A.1 B.1 A.2 B.2 
  2   1   3   3 

mydata[!mydata$inter %in% c("A.1","B.1"), ]
  Factor1 Factor2 Value inter
3       A       2  1.21   A.2
4       A       2  0.75   A.2
5       A       2  0.53   A.2
7       B       2  0.21   B.2
8       B       2  0.18   B.2
9       B       2  1.42   B.2

Updated as per @Ananda's comment :You can use following one line code after creating the interaction variable. 根据@ Ananda的评论更新 :您可以在创建交互变量后使用以下一行代码。

mydata[mydata$inter %in% names(which(table(mydata$inter) > 2)), ]

Assuming your data.frame is called mydf , you can use ave to create a logical vector to help subset: 假设您的data.frame名为mydf ,您可以使用ave创建逻辑向量来帮助子集:

mydf[with(mydf, as.logical(ave(Factor1, Factor1, Factor2, 
                           FUN = function(x) length(x) > 2))), ]
#   Factor1 Factor2 Value
# 3       A       2  1.21
# 4       A       2  0.75
# 5       A       2  0.53
# 7       B       2  0.21
# 8       B       2  0.18
# 9       B       2  1.42

Here's ave counting up your combinations. 这里的ave计数您的组合。 Notice that ave returns an object the same length as the number of rows in your data.frame (this makes it convenient for subsetting). 请注意, ave返回的对象长度与data.frame的行数相同(这样可以方便地进行子集化)。

> with(mydf, ave(Factor1, Factor1, Factor2, FUN = length))
[1] "2" "2" "3" "3" "3" "1" "3" "3" "3"

The next step is to compare that length to your threshold. 下一步是将该长度与您的阈值进行比较。 For that we need an anonymous function for our FUN argument. 为此,我们需要一个用于FUN参数的匿名函数。

> with(mydf, ave(Factor1, Factor1, Factor2, FUN = function(x) length(x) > 2))
[1] "FALSE" "FALSE" "TRUE"  "TRUE"  "TRUE"  "FALSE" "TRUE"  "TRUE"  "TRUE" 

Almost there... but since the first item was a character vector, our output is also a character vector. 差不多......但由于第一项是字符向量,我们的输出也是一个字符向量。 We want it as.logical so we can directly use it for subsetting. 我们希望它as.logical所以我们可以直接使用它进行子集化。


ave doesn't work on objects of class factor , in which case you'll need to do something like: ave不适用于类factor对象,在这种情况下你需要做类似的事情:

mydf[with(mydf, as.logical(ave(as.character(Factor1), Factor1, Factor2, 
                               FUN = function(x) length(x) > 2))),]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM