[英]How do you subset a data frame in R based on a minimum sample size
Let's say you have a data frame with two levels of factors that looks like this: 假设您有一个包含两个级别因素的数据框,如下所示:
Factor1 Factor2 Value
A 1 0.75
A 1 0.34
A 2 1.21
A 2 0.75
A 2 0.53
B 1 0.42
B 2 0.21
B 2 0.18
B 2 1.42
etc. 等等
How do I subset
this data frame ("df", if you will) based on the condition that the combination of Factor1 and Factor2 (Fact1*Fact2) has more than, say, 2 observations? 如何根据Factor1和Factor2(Fact1 * Fact2)的组合比2个观察值更多的条件,对这个数据框(“df”,如果你愿意)进行
subset
? Can you use the length
argument in subset
to do this? 你能用
subset
的length
参数来做到这一点吗?
library(data.table)
dt = data.table(your_df)
dt[, if(.N > 2) .SD, list(Factor1, Factor2)]
# Factor1 Factor2 Value
#1: A 2 1.21
#2: A 2 0.75
#3: A 2 0.53
#4: B 2 0.21
#5: B 2 0.18
#6: B 2 1.42
You can use interaction
and table
to see the number of observation for each interaction (mydata is your data) and then use %in%
to subset the data. 您可以使用
interaction
和table
来查看每次交互的观察次数(mydata是您的数据),然后使用%in%
来对数据进行子集化。
mydata$inter<-with(mydata,interaction(Factor1,Factor2))
table(mydata$inter)
A.1 B.1 A.2 B.2
2 1 3 3
mydata[!mydata$inter %in% c("A.1","B.1"), ]
Factor1 Factor2 Value inter
3 A 2 1.21 A.2
4 A 2 0.75 A.2
5 A 2 0.53 A.2
7 B 2 0.21 B.2
8 B 2 0.18 B.2
9 B 2 1.42 B.2
Updated as per @Ananda's comment :You can use following one line code after creating the interaction variable. 根据@ Ananda的评论更新 :您可以在创建交互变量后使用以下一行代码。
mydata[mydata$inter %in% names(which(table(mydata$inter) > 2)), ]
Assuming your data.frame
is called mydf
, you can use ave
to create a logical vector to help subset: 假设您的
data.frame
名为mydf
,您可以使用ave
创建逻辑向量来帮助子集:
mydf[with(mydf, as.logical(ave(Factor1, Factor1, Factor2,
FUN = function(x) length(x) > 2))), ]
# Factor1 Factor2 Value
# 3 A 2 1.21
# 4 A 2 0.75
# 5 A 2 0.53
# 7 B 2 0.21
# 8 B 2 0.18
# 9 B 2 1.42
Here's ave
counting up your combinations. 这里的
ave
计数您的组合。 Notice that ave
returns an object the same length as the number of rows in your data.frame
(this makes it convenient for subsetting). 请注意,
ave
返回的对象长度与data.frame
的行数相同(这样可以方便地进行子集化)。
> with(mydf, ave(Factor1, Factor1, Factor2, FUN = length))
[1] "2" "2" "3" "3" "3" "1" "3" "3" "3"
The next step is to compare that length to your threshold. 下一步是将该长度与您的阈值进行比较。 For that we need an anonymous function for our
FUN
argument. 为此,我们需要一个用于
FUN
参数的匿名函数。
> with(mydf, ave(Factor1, Factor1, Factor2, FUN = function(x) length(x) > 2))
[1] "FALSE" "FALSE" "TRUE" "TRUE" "TRUE" "FALSE" "TRUE" "TRUE" "TRUE"
Almost there... but since the first item was a character vector, our output is also a character vector. 差不多......但由于第一项是字符向量,我们的输出也是一个字符向量。 We want it
as.logical
so we can directly use it for subsetting. 我们希望它
as.logical
所以我们可以直接使用它进行子集化。
ave
doesn't work on objects of class factor
, in which case you'll need to do something like: ave
不适用于类factor
对象,在这种情况下你需要做类似的事情:
mydf[with(mydf, as.logical(ave(as.character(Factor1), Factor1, Factor2,
FUN = function(x) length(x) > 2))),]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.