[英]R - stratified sampling for Person Period file
Following up this question, I wondered how I can effectively sample a stratified Person Period file. 跟进这个问题,我想知道如何有效地抽样分层的Person Period文件。
I have a database who looks like this 我有一个看起来像这样的数据库
id time var clust
1: 1 1 a clust1
2: 1 2 c clust1
3: 1 3 c clust1
4: 2 1 a clust1
5: 2 2 a clust1
...
With individuals id
grouped into clusters clust
. 随着个人id
组合成集群clust
。 What I would like is to sample id
by clust
, keeping the person period format. 我想要的是通过clust
采样id
, clust
保持人句格式。
The solution I came up with is to sample id
and then to merge
back. 我想出的解决方案是采样id
,然后merge
回去。 However, is it not a very elegant solution. 但是,这不是一个非常优雅的解决方案。
library(data.table)
library(dplyr)
setDT(dt)
dt[,.SD[sample(.N,1)],by = clust] %>%
merge(., dt, by = 'id')
which gives 这使
id clust.x time.x var.x time.y var.y clust.y
1: 2 clust1 1 a 1 a clust1
2: 2 clust1 1 a 2 a clust1
3: 2 clust1 1 a 3 c clust1
4: 3 clust2 3 c 1 a clust2
5: 3 clust2 3 c 2 b clust2
6: 3 clust2 3 c 3 c clust2
7: 5 clust3 1 a 1 a clust3
8: 5 clust3 1 a 2 a clust3
9: 5 clust3 1 a 3 c clust3
Is there a more straightforward solution ? 有更直接的解决方案吗?
library(data.table)
dt = setDT(structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), .Label = c("1", "2",
"3", "4", "5", "6"), class = "factor"), time = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), .Label = c("1", "2", "3"), class = "factor"), var = structure(c(1L,
3L, 3L, 1L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L, 3L, 2L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), clust = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L), .Label = c("clust1", "clust2", "clust3"), class = "factor")), .Names = c("id",
"time", "var", "clust"), row.names = c(NA, -18L), class = "data.frame"))
Here is a variant following @Frank's comment that might help, essentially you can sample a unique id from each clust
group and find out the corresponding index number with .I
for subsetting: 这是clust
的注释之后的一个变体,可能会有所帮助,从本质clust
,您可以从每个clust
组中采样唯一的ID,并使用.I
找出对应的索引号进行子集设置:
dt[dt[, .I[id == sample(unique(id),1)], clust]$V1]
# id time var clust
#1: 2 1 a clust1
#2: 2 2 a clust1
#3: 2 3 c clust1
#4: 3 1 a clust2
#5: 3 2 b clust2
#6: 3 3 c clust2
#7: 4 1 a clust3
#8: 4 2 b clust3
#9: 4 3 c clust3
I think tidy data here would have an ID table where cluster is an attribute: 我认为这里整洁的数据会有一个ID表,其中cluster是一个属性:
idDT = unique(dt[, .(id, clust)])
id clust
1: 1 clust1
2: 2 clust1
3: 3 clust2
4: 4 clust3
5: 5 clust3
6: 6 clust2
From there, sample... 从那里取样
my_selection = idDT[, .(id = sample(id, 1)), by=clust]
and merge or subset 并合并或子集
dt[ my_selection, on=names(my_selection) ]
# or
dt[ id %in% my_selection$id ]
I would keep the intermediate table my_selection
around, expecting it to come in handy later. 我会保留中间表my_selection
,希望以后会派上用场。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.