将data.table拆分成大致相等的部分

Question

To parallelize a task, I need to split a big data.table to roughly equal parts, keeping together groups deinfed by a column, id . 为了并行化一个任务，我需要将一个大的data.table拆分为大致相等的部分，将一个组保存在一起，即一个列，即id 。 Suppose: 假设：

N is the length of the data N是数据的长度

k is the number of distinct values of id k是id的不同值的数量

M is the number of desired parts M是所需部件的数量

The idea is that M << k << N, so splitting by id is no good. 这个想法是M << k << N，所以按id并不好。

library(data.table)
library(dplyr)

set.seed(1)
N <- 16 # in application N is very large
k <- 6  # in application k << N
dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
      arrange(id)
t(dt$id)

#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# [1,] "a"  "b"  "b"  "b"  "b"  "c"  "c"  "c"  "d"  "d"   "d"   "e"   "e"   "f"   "f"   "f"

in this example, the desired split for M=3 is {{a,b}, {c,d}, {e,f}} and for M=4 is {{a,b}, {c}, {d,e}, {f}} 在此示例中， M=3的所需拆分为{{a,b}, {c,d}, {e,f}} ，而M=4为{{a,b}, {c}, {d,e}, {f}}

More generally, if id were numeric, the cutoff points should be 更一般地说，如果id是数字，则截止点应该是
quantile(id, probs=seq(0, 1, length.out = M+1), type=1) or some similar split to roughly-equal parts. quantile(id, probs=seq(0, 1, length.out = M+1), type=1)或某些类似的分割成大致相等的部分。

What is an efficient way to do this? 有效的方法是什么？

Answer 1

If distribution of the ids is not pathologically skewed the simplest approach would be simply something like this: 如果id的分布没有病态偏差，那么最简单的方法就是这样：

split(dt, as.numeric(as.factor(dt$id)) %% M)

It assigns id to the the bucket using factor-value mod number-of buckets . 它使用因子值 mod 数量的桶为桶分配id 。

For most applications it is just good enough to get a relatively balanced distribution of data. 对于大多数应用来说，获得相对均衡的数据分布就足够了。 You should be careful with input like time series though. 你应该小心输入像时间序列。 In such a case you can simply enforce random order of levels when you create factor. 在这种情况下，您可以在创建因子时简单地强制执行级别的随机顺序。 Choosing a prime number for M is a more robust approach but most likely less practical. 为M选择素数是一种更稳健的方法，但很可能不太实用。

Answer 2

Preliminary comment 初步评论

I recommend reading what the main author of data.table has to say about parallelization with it. 我建议阅读data.table的主要作者必须说的与它并行化的内容。

I don't know how familiar you are with data.table, but you may have overlooked its by argument...? 我不知道你对data.table有多熟悉，但你可能忽略了它by论点......？ Quoting @eddi's comment from below... 从下面引用@ eddi的评论......

Instead of literally splitting up the data - create a new "parallel.id" column, and then call 而不是按字面意思拆分数据 - 创建一个新的“parallel.id”列，然后调用
 dt[, parallel_operation(.SD), by = parallel.id] 

Answer, assuming you don't want to use by 答案，假设你不希望使用by

Sort the IDs by size: 按大小对ID进行排序：

ids   <- names(sort(table(dt$id)))
n     <- length(ids)

Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick : 重新排列，以便我们按照Arun的交错技巧在大小ID之间交替：

alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]

Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer ): 按顺序拆分ID，每组中的ID数量大致相同（如zero323的答案）：

gs  <- split(alt_ids, ceiling(seq(n) / (n/M)))

res <- vector("list", M)
setkey(dt, id)
for (m in 1:M) res[[m]] <- dt[J(gs[[m]])] 
# if using a data.frame, replace the last two lines with
# for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],]

Check that the sizes aren't too bad: 检查尺寸是否太差：

# using the OP's example data...

sapply(res, nrow)
# [1] 7 9              for M = 2
# [1] 5 5 6            for M = 3
# [1] 1 6 3 6          for M = 4
# [1] 1 4 2 3 6        for M = 5

Although I emphasized data.table at the top, this should work fine with a data.frame , too. 虽然我在顶部强调了data.table ，但这也适用于data.frame 。

Answer 3

If k is big enough, you can use this idea to split data into groups: 如果k足够大，您可以使用此想法将数据拆分为组：

First, lets find size for each of ids 首先，让我们找出每个ID的大小

group_sizes <- dt[, .N, by = id]

Then create 2 empty lists with length of M for detecting size of groups and which ids they would contain 然后创建2个长度为M的空列表，用于检测组的大小以及它们将包含的ID

grps_vals <- list()
grps_vals[1 : M] <- c(0)

grps_nms <- list()
grps_nms[1 : M] <- c(0)

(Here I specially added zero values to be able to create list of size M) （这里我特意添加零值以便能够创建大小为M的列表）

Then using loop on every iteration add values to the smallest group. 然后在每次迭代时使用循环将值添加到最小的组。 It will make groups roughly equal 它将使团体大致相等

for ( i in 1:nrow(group_sizes)){
   sums <- sapply(groups, sum) 
   idx <- which(sums == min(sums))[1]
   groups[[idx]] <- c(groups[[idx]], group_sizes$N[i])
   }

Finally, delete first zero element from list of names :) 最后，从名单列表中删除第一个零元素:)

grps_nms <- lapply(grps_nms, function(x){x[-1]})

> grps_nms
[[1]]
[1] "a" "d" "f"

[[2]]
[1] "b"

[[3]]
[1] "c" "e"

Answer 4

Just an alternative approach using dplyr. 只是使用dplyr的替代方法。 Run the chained script step by step to visualise how the dataset changes through each step. 逐步运行链式脚本以可视化数据集在每个步骤中的更改方式。 It is a simple process. 这是一个简单的过程。

    library(data.table)
    library(dplyr)

    set.seed(1)
    N <- 16 # in application N is very large
    k <- 6  # in application k << N
    dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
      arrange(id)



dt %>% 
  select(id) %>%
  distinct() %>%                   # select distinct id values
  mutate(group = ntile(id,3)) %>%  # create grouping 
  inner_join(dt, by="id")          # join back initial information

PS: I've learnt lots of useful stuff based on previous answers. PS：我根据之前的答案学到了很多有用的东西。

将data.table拆分成大致相等的部分

问题描述

4 个解决方案

解决方案1
5 2015-08-20 19:16:10

解决方案2
4 已采纳 2015-08-20 19:53:18

解决方案3
1 2015-08-20 20:47:57

解决方案4
1 2015-08-20 21:15:53

将data.table拆分成大致相等的部分

问题描述

4 个解决方案

解决方案1 5 2015-08-20 19:16:10

解决方案2 4 已采纳 2015-08-20 19:53:18

解决方案3 1 2015-08-20 20:47:57

解决方案4 1 2015-08-20 21:15:53

解决方案1
5 2015-08-20 19:16:10

解决方案2
4 已采纳 2015-08-20 19:53:18

解决方案3
1 2015-08-20 20:47:57

解决方案4
1 2015-08-20 21:15:53