简体   繁体   English

由列条件子集并随机采样数据表的行

[英]Subset by column criteria AND randomly sample rows of a data.table

@gented's answer here demonstrates how to randomly select a subset of rows from a data.table . @gented的答案在这里演示了如何从data.table随机选择行的子集。

What if I wanted to select all rows in a data.table for which the values in a certain column meet a specific condition, AND ADDITIONALLY select a random subset of rows from the data.table for which the values in the same column meet a different condition? 如果我想选择data.table所有列中满足特定条件的所有行,并且另外从data.table选择一个行的随机子集,而该列中同一列中的值满足不同的条件,该data.table条件?

Say, for example, that I wanted a random sample of 5 rows from the mtcars data.table for which cyl == 6 , and all rows for which cyl == 8 . 例如,假设我要从mtcars data.table随机抽取5行样本,其中cyl == 6 cyl == 8所有行。

Is this achievable in a better way than: 是否可以比以下方法更好地实现:

rbind(
    mtcars[ cyl == 8 ],
    mtcars[ cyl == 6 ][ sample(.N, 5) ]
)

That is, can I subset the data.table in a single set of [] 's so that I could also, for example, apply a function within that call (in the lapply(.SD, function) format)? 也就是说,我是否可以将data.table[]的单个集中,以便例如也可以在该调用中应用一个函数(采用lapply(.SD, function)格式)?

This obviously does not achieve the desired result, but is similar to the syntax I'm looking for: 这显然不能达到预期的结果,但是与我要寻找的语法类似:

mtcars[ 
    cyl == 8 | ( cyl == 6 & sample( .N, 5 ) ), 
    lapply(.SD, generic_funciton), 
    .SDcols = (specific_cols) 
]

As long as i ends up with something that can be used to select rows, you can put any valid expression there, which technically means you can write: 只要i最终得到可用于选择行的内容,就可以在其中放置任何有效的表达式,从技术上讲,这意味着您可以编写:

DT[c(sample(which(cyl == 6), 5L), which(cyl == 8))]

But that probably won't benefit from optimizations . 但这可能不会从优化中受益。

Based on this answer (and secondary indices ), I would think something like this would be a lot faster: 基于这个答案 (和二级索引 ),我认为这样会更快:

sample_if <- function(condition, values, n) {
  if (condition)
    sample(values, n)
  else
    values
}

some_fun <- function(.SD) {
  .SD
}

DT[DT[.(c(6, 8)), sample_if(.BY$cyl == 6, .I, 5L), by = "cyl", on = "cyl"]$V1,
   some_fun(.SD),
   .SDcols = c("cyl", "mpg")]
    cyl  mpg
 1:   6 19.7
 2:   6 19.2
 3:   6 21.4
 4:   6 21.0
 5:   6 18.1
 6:   8 18.7
 7:   8 14.3
 8:   8 16.4
 9:   8 17.3
10:   8 15.2
11:   8 10.4
12:   8 10.4
13:   8 14.7
14:   8 15.5
15:   8 15.2
16:   8 13.3
17:   8 19.2
18:   8 15.8
19:   8 15.0

To achieve that, I would utilize the .I special symbol as follows: 为此,我将使用.I特殊符号,如下所示:

DT <- as.data.table(mtcars)

DT[c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))]

Now you can do some computations: 现在您可以进行一些计算:

set.seed(2019)
DT[c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))
   , lapply(.SD, mean)
   , by = am
   , .SDcols = 3:5]

which gives: 这使:

  am disp hp drat 1: 0 325.64 179.0667 3.224667 2: 1 243.00 204.7500 3.890000 

If you want to reuse that index vector at a later moment, you can store it beforehand: 如果要在以后重用该索引向量,则可以预先存储它:

idx <- c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))

DT[idx, lapply(.SD, mean), .SDcols = 3:5]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM