将 splitstackshape 合并到循环中

Question

I have the following code that selects (4 rows of iris x 1000) *100 and calculates the bias of each column.我有以下代码选择（4 行虹膜 x 1000）*100 并计算每列的偏差。

library(SimDesign)
library(data.table)

do.call(rbind,lapply(1:100, function(x) {
  bias(
    setDT(copy(iris))[as.vector(sapply(1:1000, function(X) sample(1:nrow(iris),4)))][
      , lapply(.SD, mean), by=rep(c(1:1000),4), .SDcols=c(1:4)][,c(2:5)],
    parameter=c(5,3,2,1), #parameter is the true population value used to calculate bias
    type='relative' #denotes the type of bias being calculated 
  )
}))

This takes 1000 samples of 4 rows, calculates the mean by sample #, giving me 1000 means.这需要 4 行的 1000 个样本，通过样本 # 计算平均值，给我 1000 个平均值。 The bias for the 1000 means is found for each column, and then is done 99 more times giving me a distribution of bias estimates for each column.为每一列找到 1000 均值的偏差，然后再进行 99 次，为我提供每一列的偏差估计分布。 This is mimicking a random sampling design.这是在模仿随机抽样设计。 However, I also want to do this for a stratified design.但是，我也想为分层设计这样做。 So I use splitstackshape 's stratified function.所以我使用splitstackshape的stratified function。

do.call(rbind,lapply(1:100, function(x) {
  bias(
    setDT(copy(iris))[as.vector(sapply(1:1000, function(X) stratified(iris,group="Species", size=1)))][
      , lapply(.SD, mean), by=rep(c(1:1000),4), .SDcols=c(1:4)][,c(2:5)],
    parameter=c(5,3,2,1), 
    type='relative'
  )
}))

I would've thought that it is just a matter of swapping out the functions, but I keep on getting errors (i is invalid type (matrix)) .我原以为这只是换出函数的问题，但我不断收到错误(i is invalid type (matrix)) 。 Perhaps in future a 2 column matrix could return a list of elements of DT.也许将来一个 2 列矩阵可以返回 DT 的元素列表。 I think it might be something related to setDT, but I'm not really sure how to fix it.我认为这可能与 setDT 有关，但我不确定如何修复它。 Anybody know where I'm going wrong?有人知道我哪里出错了吗？

Answer 1

I've split into a couple of functions for you我为你分成了几个功能

Load data.table, SimDesign, and splitstackshape加载 data.table、SimDesign 和 splitstackshape

library(SimDesign)
library(data.table)
library(splitstackshape)

Function to get `n` stratified samples of size `sampsize` and return column means of those samples Function 获取`n`大小为`sampsize`的分层样本并返回这些样本的列均值

get_samples <- function(n, sampsize=4) {
  rbindlist(lapply(1:n, function(x) {
    splitstackshape::stratified(iris, group="Species",sampsize)[, id:=x]
  }))[, lapply(.SD, mean), by=.(Species, id)]
}

Now, lets get the distribution of bias across `y` such iterations of these samples现在，让我们得到这些样本的`y`个这样的迭代中的偏差分布

get_bias_distribution <- function(y=100, samples_per_iter=50, size_per_iter=4) {
  rbindlist(lapply(1:y, function(y) {
    samples = get_samples(samples_per_iter, sampsize=size_per_iter)[, id:=NULL]
    samples[, as.list(bias(
      estimate=.SD,parameter=c(5,3,2,1),type="relative")*100),
      by=.(Species)][, iter:=y]  
  }))
}

Usage (using defaults)用法（使用默认值）

get_bias_distribution()

Output: Output：

        Species Sepal.Length Sepal.Width Petal.Length Petal.Width iter
  1:     setosa    -1.236667    22.61833    -26.70000   -39.69667    1
  2: versicolor    46.476667   -11.99500    115.12833    16.82167    1
  3:  virginica    80.596667    -0.20000    180.21833    53.89000    1
  4:     setosa    -1.513333    20.87000    -27.46167   -38.83667    2
  5: versicolor    45.333333   -11.34833    112.84833    17.84500    2
 ---                                                                  
296: versicolor    48.250000   -12.26833    113.37000    17.71167   99
297:  virginica    77.366667    -2.87000    175.60000    53.07167   99
298:     setosa    -1.005000    22.67500    -27.02833   -39.69500  100
299: versicolor    47.921667   -10.28333    110.97833    16.86833  100
300:  virginica    76.153333    -2.44000    174.46167    52.62167  100

Some comments on what was going wrong above关于上面出了什么问题的一些评论

When you call stratified(iris,group="Species", size=1) , you will get a 3 row data.table, because you are effectively selecting one row at random from each of the three Species当您调用stratified(iris,group="Species", size=1)时，您将得到 3 行 data.table，因为您实际上是从三个物种中的每一个中随机选择一行

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1:          4.9         3.6          1.4         0.1     setosa
2:          6.3         2.5          4.9         1.5 versicolor
3:          7.7         2.8          6.7         2.0  virginica

When you wrap this in sapply(1:1000, function(x)...) , you get 5 x 1000 column matrix, where each column is contains 5 lists of length 3.. Below, I'm showing you what this looks like if you did sapply(1:6, function(x)...)当您将其包装在sapply(1:1000, function(x)...)中时，您会得到 5 x 1000 列矩阵，其中每列包含 5 个长度为 3 的列表。下面，我将向您展示它的外观就像你做了sapply(1:6, function(x)...)

             [,1]      [,2]      [,3]      [,4]      [,5]      [,6]     
Sepal.Length numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
Sepal.Width  numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
Petal.Length numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
Petal.Width  numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
Species      factor,3  factor,3  factor,3  factor,3  factor,3  factor,3

This is not really what you want, because you cannot then lapply over these the way you then intended.这并不是你真正想要的，因为你不能按照你当时lapply的方式应用这些。 What you want to do instead is use lapply(1:1000, function(x)...) to create a list of such 3-row datatables, and then bind them together (after adding an id column to each one).您要做的是使用lapply(1:1000, function(x)...)创建此类 3 行数据表的列表，然后将它们绑定在一起（在为每个数据表添加一个id列之后）。

将 splitstackshape 合并到循环中

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-02-24 01:17:24

Load data.table, SimDesign, and splitstackshape加载 data.table、SimDesign 和 splitstackshape

Function to get `n` stratified samples of size `sampsize` and return column means of those samples Function 获取`n`大小为`sampsize`的分层样本并返回这些样本的列均值

Now, lets get the distribution of bias across `y` such iterations of these samples现在，让我们得到这些样本的`y`个这样的迭代中的偏差分布

Usage (using defaults)用法（使用默认值）

Some comments on what was going wrong above关于上面出了什么问题的一些评论

将 splitstackshape 合并到循环中

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-02-24 01:17:24

Load data.table, SimDesign, and splitstackshape加载 data.table、SimDesign 和 splitstackshape

Function to get n stratified samples of size sampsize and return column means of those samples Function 获取n大小为sampsize的分层样本并返回这些样本的列均值

Now, lets get the distribution of bias across y such iterations of these samples现在，让我们得到这些样本的y个这样的迭代中的偏差分布

Usage (using defaults)用法（使用默认值）

Some comments on what was going wrong above关于上面出了什么问题的一些评论

解决方案1
2 已采纳 2022-02-24 01:17:24

Function to get `n` stratified samples of size `sampsize` and return column means of those samples Function 获取`n`大小为`sampsize`的分层样本并返回这些样本的列均值

Now, lets get the distribution of bias across `y` such iterations of these samples现在，让我们得到这些样本的`y`个这样的迭代中的偏差分布