[英]Group observations into specified number of groups according to id with data.table solution
I have the following data.table: 我有以下data.table:
dt <- data.table(id = rep(1:5, 5), obs = rnorm(1, n = 25))[order(id)]
dt
id obs
1: 1 0.1470735
2: 1 1.6954685
3: 1 2.3947260
4: 1 2.1782338
5: 1 0.5168873
6: 2 -0.8879545
7: 2 1.9320034
8: 2 2.6269272
9: 2 1.5212627
10: 2 -0.1581711
Which has a total of 5 distinct ids (numbers 1 through 5) and 5 observations (obs) for each id. 对于每个id,总共有5个不同的id(数字1到5)和5个观察(obs)。 I want to group the ids together randomly in groups of X ids according to id and create a new column with the grouping.
我想根据id在X ID组中随机将ID组合在一起,并使用分组创建一个新列。 For this example, let's say I want to end up with a data.table like this:
对于这个例子,假设我想最终得到一个像这样的data.table:
id obs group
1: 1 0.1470735 A
2: 1 1.6954685 A
3: 1 2.3947260 A
4: 1 2.1782338 A
5: 1 0.5168873 A
6: 2 -0.8879545 A
7: 2 1.9320034 A
8: 2 2.6269272 A
9: 2 1.5212627 A
10: 2 -0.1581711 A
Where ids 1 and 2 are assigned to group A, ids 3 and 4 are assigned to group B, and id 5 is assigned to group C. 如果将ID 1和2分配给组A,则将ID 3和4分配给组B,将id 5分配给组C.
My actual dataset is much larger and will not necessarily group evenly, but I do not need the groups to contain the same number of ids. 我的实际数据集要大得多,并不一定要均匀分组,但我不需要这些组包含相同数量的ID。 I do need to control the general size of the group (for example I want to be able to say 5 ids per group and if the last group has only 3 ids that's fine).
我确实需要控制组的一般大小(例如,我希望能够说每组5个ID,如果最后一组只有3个ID,那就没问题)。
Could someone please help me with an elegant data.table way to accomplish this? 有人可以帮我一个优雅的data.table方式来实现这一目标吗?
This is the same as @Shree's answer, just using length.out
in rep
and no dplyr. 这与@ Shree的答案相同,只是在
rep
使用length.out
而没有dplyr。
I do need to control the general size of the group (for example I want to be able to say 5 ids per group and if the last group has only 3 ids that's fine).
我确实需要控制组的一般大小(例如,我希望能够说每组5个ID,如果最后一组只有3个ID,那就没问题)。
You can make an id table; 你可以制作一张id表; assign groups there;
在那里分配小组; and if necessary merge back:
并在必要时合并回来:
# bigger, reproducible example
library(data.table)
max_per_group = 5
n_ids = 1e5+1
DT = data.table(id = rep(1:nid, each = max_per_group), obs = 1)
# make an id table
idDT = unique(DT[, "id"])
# randomly assign groups
idDT[, g := sample(rep(.I, each = 5, length.out = .N))]
# merge back if needed
DT[idDT, on=.(id), g := i.g]
You refer to "my actual dataset" -- but R allows you to juggle multiple tables. 您可以参考“我的实际数据集” - 但R允许您处理多个表。 Trying to do everything in one is almost always counterproductive.
试图在一个地方做所有事情几乎总是适得其反。
EDIT: Didn't notice that you needed this with data.table
. 编辑:没有注意到你需要
data.table
。 I'll leave this out here as an alternative. 我会把它留在这里作为替代。
I am creating a dataframe with id and randomly assigned group. 我正在创建一个id和随机分配组的数据帧。 This will be joined with your data to get groups for each record by
id
- 这将与您的数据相结合,以便按
id
获取每条记录的组 -
library(dplyr)
library(data.table)
dt <- data.table(id = rep(1:5, 5), obs = rnorm(1, n = 25))[order(id)]
max_per_group <- 5
n_ids <- length(unique(dt$id))
data.frame(id = unique(dt$id), grp = sample(rep(LETTERS, max_per_group), n_ids)) %>%
left_join(dt, ., by = "id")
id obs grp
1 1 1.28879713 S
2 1 1.04471197 S
3 1 0.36470847 S
4 1 0.46741567 S
5 1 1.07749891 S
6 2 1.73640785 K
7 2 1.61144042 K
8 2 2.85196859 K
9 2 1.84848117 K
10 2 2.11395863 K
11 3 0.88623462 S
12 3 2.11706351 S
13 3 1.29225433 S
14 3 0.30458037 S
15 3 -1.72070005 S
16 4 2.24593162 U
17 4 2.10346287 U
18 4 2.28724412 U
19 4 0.02978044 U
20 4 0.56234660 U
21 5 2.92050008 F
22 5 1.08048974 F
23 5 0.58885261 F
24 5 1.53299092 F
25 5 1.47271123 F
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.