[英]Subset by column criteria AND randomly sample rows of a data.table
@gented's answer here demonstrates how to randomly select a subset of rows from a data.table
. @gented的答案在这里演示了如何从data.table
随机选择行的子集。
What if I wanted to select all rows in a data.table
for which the values in a certain column meet a specific condition, AND ADDITIONALLY select a random subset of rows from the data.table
for which the values in the same column meet a different condition? 如果我想选择data.table
所有列中满足特定条件的所有行,并且另外从data.table
选择一个行的随机子集,而该列中同一列中的值满足不同的条件,该data.table
条件?
Say, for example, that I wanted a random sample of 5 rows from the mtcars
data.table
for which cyl == 6
, and all rows for which cyl == 8
. 例如,假设我要从mtcars
data.table
随机抽取5行样本,其中cyl == 6
, 而 cyl == 8
所有行。
Is this achievable in a better way than: 是否可以比以下方法更好地实现:
rbind(
mtcars[ cyl == 8 ],
mtcars[ cyl == 6 ][ sample(.N, 5) ]
)
That is, can I subset the data.table
in a single set of []
's so that I could also, for example, apply a function within that call (in the lapply(.SD, function)
format)? 也就是说,我是否可以将data.table
在[]
的单个集中,以便例如也可以在该调用中应用一个函数(采用lapply(.SD, function)
格式)?
This obviously does not achieve the desired result, but is similar to the syntax I'm looking for: 这显然不能达到预期的结果,但是与我要寻找的语法类似:
mtcars[
cyl == 8 | ( cyl == 6 & sample( .N, 5 ) ),
lapply(.SD, generic_funciton),
.SDcols = (specific_cols)
]
As long as i
ends up with something that can be used to select rows, you can put any valid expression there, which technically means you can write: 只要i
最终得到可用于选择行的内容,就可以在其中放置任何有效的表达式,从技术上讲,这意味着您可以编写:
DT[c(sample(which(cyl == 6), 5L), which(cyl == 8))]
But that probably won't benefit from optimizations . 但这可能不会从优化中受益。
Based on this answer (and secondary indices ), I would think something like this would be a lot faster: 基于这个答案 (和二级索引 ),我认为这样会更快:
sample_if <- function(condition, values, n) {
if (condition)
sample(values, n)
else
values
}
some_fun <- function(.SD) {
.SD
}
DT[DT[.(c(6, 8)), sample_if(.BY$cyl == 6, .I, 5L), by = "cyl", on = "cyl"]$V1,
some_fun(.SD),
.SDcols = c("cyl", "mpg")]
cyl mpg
1: 6 19.7
2: 6 19.2
3: 6 21.4
4: 6 21.0
5: 6 18.1
6: 8 18.7
7: 8 14.3
8: 8 16.4
9: 8 17.3
10: 8 15.2
11: 8 10.4
12: 8 10.4
13: 8 14.7
14: 8 15.5
15: 8 15.2
16: 8 13.3
17: 8 19.2
18: 8 15.8
19: 8 15.0
To achieve that, I would utilize the .I
special symbol as follows: 为此,我将使用.I
特殊符号,如下所示:
DT <- as.data.table(mtcars)
DT[c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))]
Now you can do some computations: 现在您可以进行一些计算:
set.seed(2019)
DT[c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))
, lapply(.SD, mean)
, by = am
, .SDcols = 3:5]
which gives: 这使:
am disp hp drat 1: 0 325.64 179.0667 3.224667 2: 1 243.00 204.7500 3.890000
If you want to reuse that index vector at a later moment, you can store it beforehand: 如果要在以后重用该索引向量,则可以预先存储它:
idx <- c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))
DT[idx, lapply(.SD, mean), .SDcols = 3:5]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.