简体   繁体   English

在data.table中跨组(不在组内)随机排序

[英]randomly ordering across groups (not within group) in data.table

Let's say I want to order the iris dataset (as a data.table ) by Species, keeping observations grouped by species and randomly ordering across species. 假设我想按物种对iris数据集进行排序(作为data.table ),保持观察结果按物种分组并在物种间随机排序。

How do I do that? 我怎么做?

I am not talking about generating a random order within groups (species). 我不是在谈论在群体(物种)中产生随机秩序。

My intuition was to write the code bellow. 我的直觉是编写下面的代码。 But it actually creates the within species random variable. 但是它实际上创建了物种内部随机变量。 Well at least it makes the question reproducible 至少它可以使问题重现

d <- iris %>% data.table
set.seed('12345')
d[,g:=runif(.N),Species]

You may do a binary search in i . 您可以在i进行二进制搜索。 A smaller example: 一个较小的示例:

d <- data.table(Species = rep(letters[1:4], each = 2), ri = 1:8)
set.seed(1)
d[.(sample(unique(Species))), on = "Species"]
#    Species ri
# 1:       b  3
# 2:       b  4
# 3:       d  7
# 4:       d  8
# 5:       c  5
# 6:       c  6
# 7:       a  1
# 8:       a  2

We can randomly sample from a series 1...N where N is the # of levels of the factor ( Species ) in question. 我们可以从1 ... N系列中随机抽取样本,其中N是所讨论的因子( Species )的水平数。

We then map the new order to a column and sort by it. 然后,我们将新订单映射到一列并对其进行排序。 Broken apart into steps for illustration it looks like this: 分为多个步骤进行说明,如下所示:

tmp      <- sample_n(as.data.frame(seq(1,length(unique(d$Species)))),3)[,1]
d$index  <- tmp[as.numeric(d$Species)]
d        <- d[order(d$index),]

You could compact this into 1 line/step: 您可以将其压缩为1行/步:

d <- d[order(sample_n(as.data.frame(seq(1,length(unique(d$Species)))),3)[,1][as.numeric(d$Species)]),]

Alternatively you could do: 或者,您可以执行以下操作:

e <- d[, .N, Species]
e[, g2 := runif(.N)]
d <- e[, .(Species, g2)][d, on = 'Species']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM