简体   繁体   English

根据带有data.table解决方案的id将观察分组到指定数量的组中

[英]Group observations into specified number of groups according to id with data.table solution

I have the following data.table: 我有以下data.table:

dt <- data.table(id = rep(1:5, 5), obs = rnorm(1, n = 25))[order(id)]
dt 

   id      obs
1:  1  0.1470735
2:  1  1.6954685
3:  1  2.3947260
4:  1  2.1782338
5:  1  0.5168873
6:  2 -0.8879545
7:  2  1.9320034
8:  2  2.6269272
9:  2  1.5212627
10: 2 -0.1581711

Which has a total of 5 distinct ids (numbers 1 through 5) and 5 observations (obs) for each id. 对于每个id,总共有5个不同的id(数字1到5)和5个观察(obs)。 I want to group the ids together randomly in groups of X ids according to id and create a new column with the grouping. 我想根据id在X ID组中随机将ID组合在一起,并使用分组创建一个新列。 For this example, let's say I want to end up with a data.table like this: 对于这个例子,假设我想最终得到一个像这样的data.table:

   id      obs      group
1:  1  0.1470735      A
2:  1  1.6954685      A
3:  1  2.3947260      A
4:  1  2.1782338      A
5:  1  0.5168873      A
6:  2 -0.8879545      A
7:  2  1.9320034      A
8:  2  2.6269272      A
9:  2  1.5212627      A
10: 2 -0.1581711      A

Where ids 1 and 2 are assigned to group A, ids 3 and 4 are assigned to group B, and id 5 is assigned to group C. 如果将ID 1和2分配给组A,则将ID 3和4分配给组B,将id 5分配给组C.

My actual dataset is much larger and will not necessarily group evenly, but I do not need the groups to contain the same number of ids. 我的实际数据集要大得多,并不一定要均匀分组,但我不需要这些组包含相同数量的ID。 I do need to control the general size of the group (for example I want to be able to say 5 ids per group and if the last group has only 3 ids that's fine). 我确实需要控制组的一般大小(例如,我希望能够说每组5个ID,如果最后一组只有3个ID,那就没问题)。

Could someone please help me with an elegant data.table way to accomplish this? 有人可以帮我一个优雅的data.table方式来实现这一目标吗?

This is the same as @Shree's answer, just using length.out in rep and no dplyr. 这与@ Shree的答案相同,只是在rep使用length.out而没有dplyr。

I do need to control the general size of the group (for example I want to be able to say 5 ids per group and if the last group has only 3 ids that's fine). 我确实需要控制组的一般大小(例如,我希望能够说每组5个ID,如果最后一组只有3个ID,那就没问题)。

You can make an id table; 你可以制作一张id表; assign groups there; 在那里分配小组; and if necessary merge back: 并在必要时合并回来:

# bigger, reproducible example
library(data.table)
max_per_group = 5
n_ids = 1e5+1
DT = data.table(id = rep(1:nid, each = max_per_group), obs = 1)

# make an id table
idDT = unique(DT[, "id"])

# randomly assign groups
idDT[, g := sample(rep(.I, each = 5, length.out = .N))]

# merge back if needed
DT[idDT, on=.(id), g := i.g]

You refer to "my actual dataset" -- but R allows you to juggle multiple tables. 您可以参考“我的实际数据集” - 但R允许您处理多个表。 Trying to do everything in one is almost always counterproductive. 试图在一个地方做所有事情几乎总是适得其反。

EDIT: Didn't notice that you needed this with data.table . 编辑:没有注意到你需要data.table I'll leave this out here as an alternative. 我会把它留在这里作为替代。

I am creating a dataframe with id and randomly assigned group. 我正在创建一个id和随机分配组的数据帧。 This will be joined with your data to get groups for each record by id - 这将与您的数据相结合,以便按id获取每条记录的组 -

library(dplyr)
library(data.table)

dt <- data.table(id = rep(1:5, 5), obs = rnorm(1, n = 25))[order(id)]

max_per_group <- 5
n_ids <- length(unique(dt$id))

data.frame(id = unique(dt$id), grp = sample(rep(LETTERS, max_per_group), n_ids)) %>%
  left_join(dt, ., by = "id")

   id         obs grp
1   1  1.28879713   S
2   1  1.04471197   S
3   1  0.36470847   S
4   1  0.46741567   S
5   1  1.07749891   S
6   2  1.73640785   K
7   2  1.61144042   K
8   2  2.85196859   K
9   2  1.84848117   K
10  2  2.11395863   K
11  3  0.88623462   S
12  3  2.11706351   S
13  3  1.29225433   S
14  3  0.30458037   S
15  3 -1.72070005   S
16  4  2.24593162   U
17  4  2.10346287   U
18  4  2.28724412   U
19  4  0.02978044   U
20  4  0.56234660   U
21  5  2.92050008   F
22  5  1.08048974   F
23  5  0.58885261   F
24  5  1.53299092   F
25  5  1.47271123   F

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用data.table和cut将变量拆分为具有相等观察值的组 - Using data.table and cut to split a variable into groups with equal observations 根据 R data.table/frame 中的组按日期汇总份额/行数 - Sum shares/number of rows according to date by groups in R data.table/frame 计算 dplyr 和 data.table 中没有观察值的类别的组均值 - Computing group means for categories with no observations in dplyr and data.table 在 data.table 中按组计算每个唯一年份的观察值 - Counting observations per unique year in group in data.table data.table:每组最近24小时观察的子集 - data.table: subset by observations in last 24 hours, per group R:根据另一个data.table有效地从data.table中选择指定的行? - R: efficiently select specified rows from a data.table according to another data.table? 使用data.table避免按组内的最后n个观察值的滚动总和中的NA - Avoiding NA in rolling sums of last n observations within by groups using data.table 使用 data.table (R) 根据列中的更改值对组进行编号 - Numbering group according changement value in a column with data.table (R) data.table 计算两个变量的总和并为“空”组添加观察 - data.table calculate sums by two variables and add observations for "empty" groups 在data.table中跨组(不在组内)随机排序 - randomly ordering across groups (not within group) in data.table
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM