来自data.table的样本

Question

I have some data.table from which I want to select a random subset, but only for some operations. 我有一些data.table，我想从中选择一个随机子集，但仅用于某些操作。

Suppose the data is 假设数据是

dat <- data.table(id=1:100, group=sample(1:20,100, replace=TRUE), a=runif(100), b=rnorm(100))

and I want to do two things: 我想做两件事：

count the number of ids per group 计算每组的ID数
select from each group one id at random and record its value on a and b 从每个组中随机选择一个ID，并将其值记录在a和b

I could follow How do you extract a few random rows from a data.table on the fly and choose 我可以关注如何动态地从data.table中提取一些随机行并选择

dat[n=.N, a=a[sample(.N,1)], b=b[sample(.N,1)], group]

but I am afraid, this will select a and b independently from one another. 但是恐怕这会彼此独立地选择a和b 。 Is there a way of selecting the same? 有没有选择相同的方法？

Answer 1

Part 1 第1部分

If you want to count the number of unique ids and some ids repeat within groups 如果您要计算唯一ID的数量，并且某些ID在组内重复

dat[, .(n_ids = uniqueN(id)), group]

If ids don't repeat within groups or you don't want to count them on a unique basis 如果ID在组内不重复，或者您不想唯一地对它们进行计数

dat[, .(n_ids = .N), group]

Part 2 第2部分

If ids repeat within groups and you want to return all rows for the randomly selected id in each group 如果ID在组内重复，并且您想返回每个组中随机选择的ID的所有行

dat[dat[, .(id = sample(id, 1)), group], on = .(id, group)]

If ids do not repeat, or you only want one row per group anyway 如果ID不重复，或者您只希望每个组一行

dat[dat[, sample(.I, 1), group]$V1]

Thanks to Frank's comment, you can also do the second option for parts 1 & 2 above in one line. 感谢Frank的评论，您也可以在一行中为上面的第1部分和第2部分做第二个选择。 This returns the row like dat[dat[, sample(.I, 1), group]$V1] but also adds a column N showing the number of ids (assumed to equal the number of rows in the group) 这将返回类似于dat[dat[, sample(.I, 1), group]$V1]但还会添加一列N显示id的数量（假定等于组中的行数）

dat[sample(.N), c(.SD[1], .N), keyby=group]

来自data.table的样本

问题描述

1 个解决方案

解决方案1
7 已采纳 2019-06-25 19:18:07

Part 1 第1部分

Part 2 第2部分

来自data.table的样本

问题描述

1 个解决方案

解决方案1 7 已采纳 2019-06-25 19:18:07

Part 1 第1部分

Part 2 第2部分

解决方案1
7 已采纳 2019-06-25 19:18:07