简体   繁体   English

大数据帧的高效组合和操作

[英]Efficient Combination and Operating on Large Data Frames

I have 2 relatively large data frames in R. I'm attempting to merge / find all combos, as efficiently as possible.我在 R 中有 2 个相对较大的数据框。我正在尝试尽可能高效地合并/查找所有组合。 The resulting df turns out to be huge (the length is dim(myDF1)[1]*dim(myDF2)[1] ), so I'm attempting to implement a solution using ff .结果 df 变得很大(长度为dim(myDF1)[1]*dim(myDF2)[1] ),因此我尝试使用ff实施解决方案。 I'm also open to using other solutions, such as the bigmemory package to work-around these memory issues.我也愿意使用其他解决方案,例如bigmemory包来解决这些内存问题。 I'm have virtually no experience with either of these packages.我几乎没有使用这些软件包的经验。

Working example - assume I'm working with some data frame that looks similar to USArrests:工作示例 - 假设我正在使用一些看起来类似于 USArrests 的数据框:

library('ff')
library('ffbase')


myNames <- USArrests

myNames$States <- rownames(myNames)
rownames(myNames) <- NULL

Now, I will fabricate 2 data frames, which represent some particular sets of observations from myNames.现在,我将制造2个的数据帧,其表示从myNames一些特定观测的。 I'm going to try to reference them by their rownames later.稍后我将尝试通过行名来引用它们。

myDF1 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(3*1e5, 1, 50))], ncol = 3)))
myDF2 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(2*1e5, 1, 50))], ncol = 2)))


# unique combos:
myDF1 <- unique(myDF1)
myDF2 <- unique(myDF2)

For example, my first set of states in myDF1 are myNames[unlist(myDF1[1, ]), ] .例如,我在 myDF1 中的第一组状态是myNames[unlist(myDF1[1, ]), ] Then I will find all combos of myDF1 and myDF2 using ikey :然后我将使用ikey找到 myDF1 和 myDF2 的所有组合:

# create keys:
myDF1$key <- ikey(myDF1)
myDF2$key <- ikey(myDF2)

startTime <- Sys.time()


# Create some huge vectors:
myVector1 <- ffrep.int(myDF1$key, dim(myDF2)[1])
myVector2 <- ffrep.int(myDF2$key, dim(myDF1)[1])


# This takes about 25 seconds on my machine:
print(Sys.time() - startTime)


# Sort one DF (to later combine with the other):
myVector2  <- ffsorted(myVector2)

# Sorting takes an additional 2.5 minutes:
print(Sys.time() - startTime)

1) Is there a faster way to sort this? 1)有没有更快的方法来排序?

# finally, find all combinations:
myDF <- as.ffdf(myVector1, myVector2)

# Very fast:
print(Sys.time() - startTime)

2) Is there an alternative to this type of combination (without using RAM)? 2)有没有替代这种类型的组合(不使用RAM)?

Finally, I'd like to be able to reference any of the original data by row / column.最后,我希望能够按行/列引用任何原始数据。 Specifically, I'd like to get different types of rowSums.具体来说,我想获得不同类型的 rowSums。 For example:例如:

# Here are the row numbers (from myNames) for the top 6 sets of States:
this <- cbind(myDF1[myDF[1:6,1], -4], myDF2[myDF[1:6,2], -3])
this

# Then, the original data for the first set of States is:
myNames[unlist(this[1,]),]

# Suppose I want to get the sum of the Urban Population for every row, such as the first:
sum(myNames[unlist(this[1,]),]$UrbanPop)

3) Ultimately, I'd like a vector with the above rowSum, so I can perform some type of subset on myDF . 3)最终,我想要一个具有上述 rowSum 的向量,这样我就可以在myDF上执行某种类型的子集。 Any advice on how to most efficiently accomplish this?关于如何最有效地实现这一目标的任何建议?

Thanks!谢谢!

It's pretty much unclear to me what you intent to do with the rowSum and your 3) element but if you want an efficient and RAM-friendly combination of 2 ff vectors, to get all combinations, you can use expand.ffgrid from ffbase.我不太清楚您打算对 rowSum 和 3) 元素做什么,但是如果您想要 2 个 ff 向量的高效且对 RAM 友好的组合,要获得所有组合,您可以使用 ffbase 中的 expand.ffgrid。 The following will generate your ffdf with dimensions 160Mio rows x 2 columns in a few seconds.以下将在几秒钟内生成尺寸为 160Mio 行 x 2 列的 ffdf。

require(ffbase)
x <- expand.ffgrid(myDF1$key, myDF2$key)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM