由列条件子集并随机采样数据表的行

Question

@gented's answer here demonstrates how to randomly select a subset of rows from a data.table . @gented的答案在这里演示了如何从data.table随机选择行的子集。

What if I wanted to select all rows in a data.table for which the values in a certain column meet a specific condition, AND ADDITIONALLY select a random subset of rows from the data.table for which the values in the same column meet a different condition? 如果我想选择data.table所有列中满足特定条件的所有行，并且另外从data.table选择一个行的随机子集，而该列中同一列中的值满足不同的条件，该data.table条件？

Say, for example, that I wanted a random sample of 5 rows from the mtcars data.table for which cyl == 6 , and all rows for which cyl == 8 . 例如，假设我要从mtcars data.table随机抽取5行样本，其中cyl == 6 ，而 cyl == 8所有行。

Is this achievable in a better way than: 是否可以比以下方法更好地实现：

rbind(
    mtcars[ cyl == 8 ],
    mtcars[ cyl == 6 ][ sample(.N, 5) ]
)

That is, can I subset the data.table in a single set of [] 's so that I could also, for example, apply a function within that call (in the lapply(.SD, function) format)? 也就是说，我是否可以将data.table在[]的单个集中，以便例如也可以在该调用中应用一个函数（采用lapply(.SD, function)格式）？

This obviously does not achieve the desired result, but is similar to the syntax I'm looking for: 这显然不能达到预期的结果，但是与我要寻找的语法类似：

mtcars[ 
    cyl == 8 | ( cyl == 6 & sample( .N, 5 ) ), 
    lapply(.SD, generic_funciton), 
    .SDcols = (specific_cols) 
]

Answer 1

As long as i ends up with something that can be used to select rows, you can put any valid expression there, which technically means you can write: 只要i最终得到可用于选择行的内容，就可以在其中放置任何有效的表达式，从技术上讲，这意味着您可以编写：

DT[c(sample(which(cyl == 6), 5L), which(cyl == 8))]

But that probably won't benefit from optimizations . 但这可能不会从优化中受益。

Based on this answer (and secondary indices ), I would think something like this would be a lot faster: 基于这个答案（和二级索引），我认为这样会更快：

sample_if <- function(condition, values, n) {
  if (condition)
    sample(values, n)
  else
    values
}

some_fun <- function(.SD) {
  .SD
}

DT[DT[.(c(6, 8)), sample_if(.BY$cyl == 6, .I, 5L), by = "cyl", on = "cyl"]$V1,
   some_fun(.SD),
   .SDcols = c("cyl", "mpg")]
    cyl  mpg
 1:   6 19.7
 2:   6 19.2
 3:   6 21.4
 4:   6 21.0
 5:   6 18.1
 6:   8 18.7
 7:   8 14.3
 8:   8 16.4
 9:   8 17.3
10:   8 15.2
11:   8 10.4
12:   8 10.4
13:   8 14.7
14:   8 15.5
15:   8 15.2
16:   8 13.3
17:   8 19.2
18:   8 15.8
19:   8 15.0

Answer 2

To achieve that, I would utilize the .I special symbol as follows: 为此，我将使用.I特殊符号，如下所示：

DT <- as.data.table(mtcars)

DT[c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))]

Now you can do some computations: 现在您可以进行一些计算：

set.seed(2019)
DT[c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))
   , lapply(.SD, mean)
   , by = am
   , .SDcols = 3:5]

which gives: 这使：

  am disp hp drat 1: 0 325.64 179.0667 3.224667 2: 1 243.00 204.7500 3.890000

If you want to reuse that index vector at a later moment, you can store it beforehand: 如果要在以后重用该索引向量，则可以预先存储它：

idx <- c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))

DT[idx, lapply(.SD, mean), .SDcols = 3:5]

由列条件子集并随机采样数据表的行

问题描述

2 个解决方案

解决方案1
1 2019-07-04 19:06:20

解决方案2
1 已采纳 2019-07-04 20:48:59

由列条件子集并随机采样数据表的行

问题描述

2 个解决方案

解决方案1 1 2019-07-04 19:06:20

解决方案2 1 已采纳 2019-07-04 20:48:59

解决方案1
1 2019-07-04 19:06:20

解决方案2
1 已采纳 2019-07-04 20:48:59