简体   繁体   English

R data.table按组有效复制

[英]R data.table efficient replication by group

I am running into some memory allocation problems trying to replicate some data by groups using data.table and rep. 我遇到了一些内存分配问题,试图使用data.table和rep按组复制某些数据。

Here is some sample data: 以下是一些示例数据:

ob1 <- as.data.frame(cbind(c(1999),c("THE","BLACK","DOG","JUMPED","OVER","RED","FENCE"),c(4)),stringsAsFactors=FALSE)
ob2 <- as.data.frame(cbind(c(2000),c("I","WALKED","THE","BLACK","DOG"),c(3)),stringsAsFactors=FALSE)
ob3 <- as.data.frame(cbind(c(2001),c("SHE","PAINTED","THE","RED","FENCE"),c(1)),stringsAsFactors=FALSE)
ob4 <- as.data.frame(cbind(c(2002),c("THE","YELLOW","HOUSE","HAS","BLACK","DOG","AND","RED","FENCE"),c(2)),stringsAsFactors=FALSE)
sample_data <- rbind(ob1,ob2,ob3,ob4)
colnames(sample_data) <- c("yr","token","multiple")

What I am trying to do is replicate the tokens (in the present order) by the multiple for each year. 我想做的是按每年的倍数(按当前顺序)复制令牌。

The following code works and gives me the answer I want: 以下代码有效,并给了我我想要的答案:

good_solution1 <- ddply(sample_data, "yr", function(x) data.frame(rep(x[,2],x[1,3])))

good_solution2 <- data.table(sample_data)[, rep(token,unique(multiple)),by = "yr"]

The issue is that when I scale this up to 40mm+ rows, I get into memory issues for both possible solutions. 问题是,当我将行扩展到40mm +时,两种可能的解决方案都会遇到内存问题。

If my understanding is correct, these solutions are essentially doing an rbind which allocates everytime. 如果我的理解是正确的,那么这些解决方案实际上就是在做一个每次分配的rbind。

Does anyone have a better solution? 有谁有更好的解决方案?

I looked at set() for data.table but was running into issues because I wanted to keep the tokens in the same order for each replication. 我查看了data.table的set(),但是遇到了问题,因为我想让每次复制的令牌保持相同的顺序。

One way is: 一种方法是:

require(data.table)
dt <- data.table(sample_data)
# multiple seems to be a character, convert to numeric
dt[, multiple := as.numeric(multiple)]
setkey(dt, "multiple")
dt[J(rep(unique(multiple), unique(multiple))), allow.cartesian=TRUE]

Everything except the last line should be straightforward. 除了最后一行之外的所有内容都应该简单明了。 The last line uses a subset using key column with the help of J(.) . 最后一行在J(.)的帮助下使用使用键列的子集。 For each value in J(.) the corresponding value is matched with "key column" and the matched subset is returned. 对于J(.)的每个值,对应的值与“键列”匹配,并返回匹配的子集。

That is, if you do dt[J(1)] you'll get the subset where multiple = 1 . 也就是说,如果您执行dt[J(1)]您将获得multiple = 1的子集。 And if you note carefully, by doing dt[J(rep(1,2)] gives you the same subset, but twice. Note that there's a difference between passing dt[J(1,1)] and dt[J(rep(1,2)] . The former is matching values of (1,1) with the first-two-key-columns of the data.table respectively, where as the latter is subsetting by matching (1 and 2) against the first-key column of the data.table. 而且,如果您仔细地注意,通过执行dt[J(rep(1,2)]可以得到相同的子集,但是可以得到两次)。请注意,传递dt[J(1,1)]dt[J(rep(1,2)] dt[J(1,1)]有区别dt[J(rep(1,2)] 。前者是将(1,1)的值分别与data.table的前两个键列匹配,而后者是通过将(1和2)与第一个键匹配所述data.table的-key柱。

So, if we were to pass the same value of the column 2 times in J(.) , then it gets be duplicated twice. 因此,如果我们要在J(.)两次传递相同的列值,那么它将被重复两次。 We use this trick to pass 1 1-time, 2 2-times etc.. and that's what the rep(.) part does. 我们使用这个技巧来传递1 1次,2 2次等。这就是rep(.)部分所做的。 rep(.) gives 1,2,2,3,3,3,4,4,4,4. rep(.)给出1,2,2,3,3,3,4,4,4,4。

And if the join results in more rows than max(nrow(dt), nrow(i)) (i is the rep vector that's inside J(.) ), you've to explicitly use allow.cartesian = TRUE to perform this join (I guess this is a new feature from data.table 1.8.8). 并且如果max(nrow(dt), nrow(i))结果行比max(nrow(dt), nrow(i)) (i是J(.)内的rep向量),则必须显式使用allow.cartesian = TRUE来执行此allow.cartesian = TRUE (我猜这是data.table 1.8.8中的新功能)。


Edit: Here's some benchmarking I did on a "relatively" big data. 编辑:这是我对“相对”大数据所做的一些基准测试。 I don't see any spike in memory allocations in both methods. 在这两种方法中,我都看不到任何内存分配高峰。 But I've yet to find a way to monitor peak memory usage within a function in R. I am sure I've seen such a post here on SO, but it slips me at the moment. 但是我还没有找到一种方法来监视R中某个函数中的峰值内存使用情况。我敢肯定,我在SO上看到过这样的帖子,但此刻让我感到困惑。 I'll write back again. 我会再回信。 For now, here's a test data and some preliminary results in case anyone is interested/wants to run it for themselves. 现在,这里是一个测试数据和一些初步结果,以防任何人有兴趣/想要自己运行它。

# dummy data
set.seed(45)
yr <- 1900:2013
sz <- sample(10:50, length(yr), replace = TRUE)
token <- unlist(sapply(sz, function(x) do.call(paste0, data.frame(matrix(sample(letters, x*4, replace=T), ncol=4)))))
multiple <- rep(sample(500:5000, length(yr), replace=TRUE), sz)

DF <- data.frame(yr = rep(yr, sz), 
                 token = token, 
                 multiple = multiple, stringsAsFactors=FALSE)

# Arun's solution
ARUN.DT <- function(dt) {
    setkey(dt, "multiple")
    idx <- unique(dt$multiple)
    dt[J(rep(idx,idx)), allow.cartesian=TRUE]
}

# Ricardo's solution
RICARDO.DT <- function(dt) {
    setkey(dt, key="yr")
    newDT <- setkey(dt[, rep(NA, list(rows=length(token) * unique(multiple))), by=yr][, list(yr)], 'yr')
    newDT[, tokenReps := as.character(NA)]

    # Add the rep'd tokens into newDT, using recycling
    newDT[, tokenReps := dt[.(y)][, token], by=list(y=yr)]
    newDT
}

# create data.table
require(data.table)
DT <- data.table(DF)

# benchmark both versions
require(rbenchmark)
benchmark(res1 <- ARUN.DT(DT), res2 <- RICARDO.DT(DT), replications=10, order="elapsed")

#                     test replications elapsed relative user.self sys.self
# 1    res1 <- ARUN.DT(DT)           10   9.542    1.000     7.218    1.394
# 2 res2 <- RICARDO.DT(DT)           10  17.484    1.832    14.270    2.888

But as Ricardo says, it may not matter if you run out of memory. 但是正如里卡多所说的,如果内存用完了可能并不重要。 So, in that case, there has to be a trade-off between speed and memory. 因此,在那种情况下,必须在速度和内存之间进行权衡。 What I'd like to verify is the peak memory used in both methods here to say definitively if using Join is better. 我想验证的是,这里的两种方法都使用了峰值内存,以确定使用Join更好。

you can try allocating the memory for all the rows first, and then populating them iteratively. 您可以先尝试为所有行分配内存,然后迭代填充它们。
eg: 例如:

  # make sure `sample_data$multiple` is an integer
  sample_data$multiple <- as.integer(sample_data$multiple)

  # create data.table
  S <- data.table(sample_data, key='yr')

  # optionally, drop original data.frame if not needed
  rm(sample_data)

  ## Allocate the memory first
  newDT <- data.table(yr = rep(sample_data$yr, sample_data$multiple), key="yr")
  newDT[, tokenReps := as.character(NA)]

  # Add the rep'd tokens into newDT, using recycling
  newDT[, tokenReps := S[.(y)][, token], by=list(y=yr)]

Two notes: 两个注意事项:

(1) sample_data$multiple is currently a character and thus getting coerced when passed to rep (in your original example). (1) sample_data$multiple当前是一个字符,因此在传递给rep时会被强制(在您的原始示例中)。 It might be worth double-checking your real data if that is also the case. 如果是这种情况,可能值得仔细检查您的真实数据。

(2) I used the following to determine the number of rows needed per year (2)我使用以下方法确定每年所需的行数

S[, list(rows=length(token) * unique(multiple)), by=yr] 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM