简体   繁体   English

如何在 R 中为多列创建分层采样

[英]How to create Stratified Sampling for multiple columns in R

my data set has got 821049 variables and 18 columns.我的数据集有 821049 个变量和 18 列。 I would like to take 9 columns for the stratified sampling.我想为分层抽样取 9 列。 These are "BASKETS_NZ", "PIS", "PIS_AP" "PIS_DV", "PIS_PL", "PIS_SDV", "PIS_SHOPS" "PIS_SR", "QUANTITY".这些是“BASKETS_NZ”、“PIS”、“PIS_AP”、“PIS_DV”、“PIS_PL”、“PIS_SDV”、“PIS_SHOPS”、“PIS_SR”、“QUANTITY”。 My stratification variable is ID = 1:821049.我的分层变量是 ID = 1:821049。 How do I choose the intervals for my variables?如何为我的变量选择区间? How do I set the size of the sampling?如何设置采样的大小?

dpt(rbind(head(WKA_ohneJB, 10), tail(WKA_ohneJB, 10))) dpt(rbind(头(WKA_ohneJB, 10), 尾(WKA_ohneJB, 10)))

structure(list(X = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 

821039L, 821040L, 821041L, 821042L, 821043L, 821044L, 821045L, 

821046L, 821047L, 821048L), BASKETS_NZ = c(1L, 1L, 1L, 1L, 1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 

LOGONS = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), PIS = c(71L, 39L, 50L, 4L, 

13L, 4L, 30L, 65L, 13L, 31L, 111L, 33L, 3L, 46L, 11L, 8L, 

17L, 68L, 65L, 15L), PIS_AP = c(14L, 2L, 4L, 0L, 0L, 0L, 

1L, 0L, 2L, 1L, 13L, 0L, 0L, 2L, 1L, 0L, 3L, 8L, 0L, 1L), 

PIS_DV = c(3L, 19L, 4L, 1L, 0L, 0L, 6L, 2L, 2L, 3L, 38L, 

8L, 0L, 5L, 2L, 0L, 1L, 0L, 3L, 2L), PIS_PL = c(0L, 5L, 8L, 

2L, 0L, 0L, 0L, 24L, 0L, 6L, 32L, 8L, 0L, 0L, 4L, 0L, 0L, 

0L, 0L, 0L), PIS_SDV = c(18L, 0L, 11L, 0L, 0L, 0L, 0L, 0L, 

0L, 1L, 6L, 0L, 0L, 13L, 0L, 0L, 1L, 15L, 1L, 0L), PIS_SHOPS = c(3L, 

24L, 13L, 3L, 0L, 0L, 6L, 28L, 2L, 11L, 71L, 16L, 2L, 5L, 

6L, 0L, 1L, 0L, 3L, 2L), PIS_SR = c(19L, 0L, 14L, 0L, 0L, 

0L, 2L, 23L, 0L, 3L, 6L, 0L, 0L, 20L, 0L, 0L, 3L, 32L, 1L, 

0L), QUANTITY = c(13L, 2L, 18L, 1L, 14L, 1L, 4L, 2L, 5L, 

1L, 5L, 2L, 2L, 4L, 1L, 3L, 2L, 8L, 17L, 8L), WKA = c(1L, 

1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 

0L, 0L, 1L, 1L), NEW_CUST = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 

0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), EXIST_CUST = c(1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 

1L, 1L, 1L, 1L), WEB_CUST = c(1L, 0L, 0L, 0L, 1L, 1L, 0L, 

1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), MOBILE_CUST = c(0L, 

1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 

1L, 0L, 1L, 0L), TABLET_CUST = c(0L, 0L, 0L, 0L, 0L, 0L, 

0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L), 

LOGON_CUST_STEP2 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 

0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(1L, 

2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 821039L, 821040L, 821041L, 

821042L, 821043L, 821044L, 821045L, 821046L, 821047L, 821048L

), class = "data.frame") 

在此处输入图像描述

在此处输入图像描述

Here is a solution to perform a stratified sampling based on multiple columns.这是基于多列执行分层抽样的解决方案。 Before implementing this, consider that your data is continuous and a sufficiently large that just a random sampling is adequate.在实施此操作之前,请考虑您的数据是连续的并且足够大,仅随机抽样就足够了。

To solve this problem is to take a stratified sample from each group.解决这个问题的方法是从每个组中抽取一个分层样本。 The potential approaches to group the data together is by either pasting the 9 columns together or using dplyr's groupby function.将数据组合在一起的潜在方法是将 9 列粘贴在一起或使用 dplyr 的 groupby function。

Using the solution is this question How to get around error "factor has new levels" in cross-validation glm?使用解决方案是这个问题How to get around error "factor has new levels" in cross-validation glm? and updating with dplyr style.并使用 dplyr 样式进行更新。

This dplyr_stratified function will take the desired sampling ration and an arbitrary number of column and will return a data frame with the sampled rows.此 dplyr_stratified function 将采用所需的采样率和任意数量的列,并将返回带有采样行的数据帧。 See the example below for taking 2 columns.请参阅下面的示例以获取 2 列。

set.seed(1)
x <- rnorm(n = 100)
y <- rep(x = c("A","B"), times = c(50,50))
z <- rep(x = c("D","E","F"), times = c(33,33,34))
data <- data.frame(x, y=sample(y, replace = TRUE), z=sample(z, replace=TRUE))

library(dplyr)
#optional tag row for later identification: 
data$rowid<-1:nrow(data)
dplyr_stratified <- function(df, percent, ...){
  columns<-enquos(...)
   #group then sample each group
  out<-df %>% group_by(!!!columns)  %>% slice( sample(1:n(), percent*n())) 
}

testgroup<-dplyr_stratified(data, 0.8, z, y)
testgroup

Note: this is assuming each grouping will have a sufficient number of sample in order to select a representative sample.注意:这是假设每个分组将有足够数量的样本,以便 select 成为具有代表性的样本。 (If the groups are too small then this approach may not meet expectations) (如果组太小,那么这种方法可能达不到预期)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM