从长到大格式重塑大数据的有效方法 - 类似于dcast

Question

This question pertains to creating "wide" tables similar to tables you could create using dcast from reshape2. 这个问题适用于创建“宽”表，类似于您可以使用reshape2中的dcast创建的表。 I know this has been discussed many times before, but my question pertains to how to make the process more efficient. 我知道之前已经讨论了很多次，但我的问题是如何使这个过程更有效率。 I have provided several examples below which might make the question seem lengthy, but most of it is just test code for benchmarking 我在下面提供了几个例子，这些例子可能会使问题看起来很冗长，但大多数只是用于基准测试的测试代码

Starting with a simple example, 从一个简单的例子开始，

> z <- data.table(col1=c(1,1,2,3,4), col2=c(10,10,20,20,30), 
                  col3=c(5,2,2.3,2.4,100), col4=c("a","a","b","c","a"))

> z
     col1 col2  col3 col4
1:    1   10   5.0    a      # col1 = 1, col2 = 10
2:    1   10   2.0    a      # col1 = 1, col2 = 10
3:    2   20   2.3    b
4:    3   20   2.4    c
5:    4   30 100.0    a

We need to create a "wide" table that will have the values of the col4 column as column names and the value of the sum(col3) for each combination of col1 and col2. 我们需要创建一个“宽”表，它将col4列的值作为列名，并将col1和col2的每个组合的和（col3）值组合在一起。

> ulist = unique(z$col4) # These will be the additional column names

# Create long table with sum
> z2 <- z[,list(sumcol=sum(col3)), by='col1,col2,col4']

# Pivot the long table
> z2 <- z2[,as.list((sumcol[match(ulist,col4)])), by=c("col1","col2")]

# Add column names
> setnames(z2[],c("col1","col2",ulist))

> z2
   col1 col2   a   b   c
1:    1   10   7  NA  NA  # a = 5.0 + 2.0 = 7 corresponding to col1=1, col2=10
2:    2   20  NA 2.3  NA
3:    3   20  NA  NA 2.4
4:    4   30 100  NA  NA

The issue I have is that while the above method is fine for smaller tables, it's virtually impossible to run them (unless you are fine with waiting x hours maybe) on very large tables. 我遇到的问题是，虽然上面的方法适用于较小的表，但实际上不可能在非常大的表上运行它们（除非你可以等待x小时）。

This, I believe is likely related to the fact that the pivoted / wide table is of a much larger size than the original tables since each row in the wide table has n columns corresponding to the unique values of the pivot column no matter whether there is any value that corresponds to that cell (these are the NA values above). 我相信这可能与这样一个事实有关：旋转/宽表的大小比原始表大得多，因为宽表中的每一行都有n列对应于枢轴列的唯一值，无论是否有与该单元格对应的任何值（这些是上面的NA值）。 The size of the new table is therefore often 2x+ that of the original "long" table. 因此，新表的大小通常是原始“长”表的2倍。

My original table has ~ 500 million rows, about 20 unique values. 我的原始表有大约5亿行，大约20个唯一值。 I have attempted to run the above using only 5 million rows and it takes forever in R (too long to wait for it to complete). 我试图仅使用500万行来运行上面的内容并且它需要永远在R中（等待它完成的时间太长）。

For benchmarking purposes, the example (using 5 million rows) - completes in about 1 minute using production rdbms systems running multithreaded. 出于基准测试目的，该示例（使用500万行） - 使用运行多线程的生产rdbms系统在大约1分钟内完成。 It completes in about 8 "seconds" using single core using KDB+/Q ( http://www.kx.com ). 它使用KDB + / Q（ http://www.kx.com ）使用单核在大约8“秒内完成。 It might not be a fair comparison, but gives a sense that it is possible to do these operations much faster using alternative means. 这可能不是一个公平的比较，但给出一种感觉，即使用替代方法可以更快地完成这些操作。 KDB+ doesn't have sparse rows, so it is allocating memory for all the cells and still much faster than anything else I have tried. KDB +没有稀疏行，因此它为所有单元分配内存，并且仍然比我尝试过的任何其他内容快得多。

What I need however, is an R solution :) and so far, I haven't found an efficient way to perform similar operations. 然而，我需要的是一个R解决方案 :)到目前为止，我还没有找到一种有效的方法来执行类似的操作。

If you have had experience and could reflect upon any alternative / more optimal solution, I'd be interested in knowing the same. 如果您有经验并且可以反思任何替代/更优化的解决方案，我会有兴趣知道相同的。 A sample code is provided below. 下面提供了示例代码。 You can vary the value for n to simulate the results. 您可以更改n的值以模拟结果。 The unique values for the pivot column (column c3) have been fixed at 25. 枢轴列（列c3）的唯一值已固定为25。

n = 100 # Increase this to benchmark

z <- data.table(c1=sample(1:10000,n,replace=T),
    c2=sample(1:100000,n,replace=T),
    c3=sample(1:25,n,replace=T),
    price=runif(n)*10)

c3.unique <- 1:25

z <- z[,list(sumprice=sum(price)), by='c1,c2,c3'][,as.list((sumprice[match(c3.unique,c3)])), by='c1,c2']
setnames(z[], c("c1","c2",c3.unique))

Thanks, 谢谢，

Raj. 拉吉。

Answer 1

For n=1e6 the following takes about 10 seconds with plain dcast and about 4 seconds with dcast.data.table : 对于n=1e6下面以用普通约10秒dcast和大约4秒dcast.data.table ：

library(reshape2)

dcast(z[, sum(price), by = list(c1, c2, c3)], c1 + c2 ~ c3)

# or with 1.8.11
dcast.data.table(z, c1 + c2 ~ c3, fun = sum)

从长到大格式重塑大数据的有效方法 - 类似于dcast

问题描述

1 个解决方案

解决方案1
5 2013-12-04 22:28:59

从长到大格式重塑大数据的有效方法 - 类似于dcast

问题描述

1 个解决方案

解决方案1 5 2013-12-04 22:28:59

解决方案1
5 2013-12-04 22:28:59