简体   繁体   English

R-人员期间文件的分层抽样

[英]R - stratified sampling for Person Period file

Following up this question, I wondered how I can effectively sample a stratified Person Period file. 跟进这个问题,我想知道如何有效地抽样分层的Person Period文件。

I have a database who looks like this 我有一个看起来像这样的数据库

    id time var  clust
 1:  1    1   a clust1
 2:  1    2   c clust1
 3:  1    3   c clust1
 4:  2    1   a clust1
 5:  2    2   a clust1
...

With individuals id grouped into clusters clust . 随着个人id组合成集群clust What I would like is to sample id by clust , keeping the person period format. 我想要的是通过clust采样idclust保持人句格式。

The solution I came up with is to sample id and then to merge back. 我想出的解决方案是采样id ,然后merge回去。 However, is it not a very elegant solution. 但是,这不是一个非常优雅的解决方案。

library(data.table) 
library(dplyr) 

setDT(dt) 

dt[,.SD[sample(.N,1)],by = clust] %>% 
  merge(., dt, by = 'id')

which gives 这使

   id clust.x time.x var.x time.y var.y clust.y
1:  2  clust1      1     a      1     a  clust1
2:  2  clust1      1     a      2     a  clust1
3:  2  clust1      1     a      3     c  clust1
4:  3  clust2      3     c      1     a  clust2
5:  3  clust2      3     c      2     b  clust2
6:  3  clust2      3     c      3     c  clust2
7:  5  clust3      1     a      1     a  clust3
8:  5  clust3      1     a      2     a  clust3
9:  5  clust3      1     a      3     c  clust3

Is there a more straightforward solution ? 有更直接的解决方案吗?

library(data.table)
dt = setDT(structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), .Label = c("1", "2", 
"3", "4", "5", "6"), class = "factor"), time = structure(c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
 3L), .Label = c("1", "2", "3"), class = "factor"), var = structure(c(1L, 
3L, 3L, 1L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L, 3L, 2L, 2L, 
3L), .Label = c("a", "b", "c"), class = "factor"), clust = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 
2L), .Label = c("clust1", "clust2", "clust3"), class = "factor")), .Names =  c("id", 
 "time", "var", "clust"), row.names = c(NA, -18L), class = "data.frame"))

Here is a variant following @Frank's comment that might help, essentially you can sample a unique id from each clust group and find out the corresponding index number with .I for subsetting: 这是clust的注释之后的一个变体,可能会有所帮助,从本质clust ,您可以从每个clust组中采样唯一的ID,并使用.I找出对应的索引号进行子集设置:

dt[dt[, .I[id == sample(unique(id),1)], clust]$V1]

#   id time var  clust
#1:  2    1   a clust1
#2:  2    2   a clust1
#3:  2    3   c clust1
#4:  3    1   a clust2
#5:  3    2   b clust2
#6:  3    3   c clust2
#7:  4    1   a clust3
#8:  4    2   b clust3
#9:  4    3   c clust3

I think tidy data here would have an ID table where cluster is an attribute: 我认为这里整洁的数据会有一个ID表,其中cluster是一个属性:

idDT = unique(dt[, .(id, clust)])


   id  clust
1:  1 clust1
2:  2 clust1
3:  3 clust2
4:  4 clust3
5:  5 clust3
6:  6 clust2

From there, sample... 从那里取样

my_selection = idDT[, .(id = sample(id, 1)), by=clust]

and merge or subset 并合并或子集

dt[ my_selection, on=names(my_selection) ]
# or 
dt[ id %in% my_selection$id ]

I would keep the intermediate table my_selection around, expecting it to come in handy later. 我会保留中间表my_selection ,希望以后会派上用场。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM