简体   繁体   English

full_join()r中段/批次中的两个数据帧

[英]full_join() two data frames in segments/batches in r

I have two data frames that I am trying to merge. 我有两个要合并的数据框。

df1 has dimensions 20015 rows and 7 variables. df1具有维度20015行和7个变量。 df2 has dimensions 8534664 rows and 29 variables. df2尺寸为8534664行和29个变量。

When I do full_join(df1, df2, by = "KEY") I get the Error: cannot allocate vector of size 891.2 Mb so I set memory.limit(1000000) and I still get the same error. 当我执行full_join(df1, df2, by = "KEY") ,出现Error: cannot allocate vector of size 891.2 Mb因此我设置了memory.limit(1000000) ,但仍然收到相同的错误。 I run the full_join() whilst looking at my CPU usage graph in the windows task manager and it increases exponentially. 我在Windows任务管理器中查看我的CPU使用率图时运行了full_join() ,它呈指数增长。 I have also used gc() through out my code. 我在整个代码中也使用了gc()

My question is, is there a function out there which can join the first 1,000,000 rows. 我的问题是,有没有可以加入前1,000,000行的函数。 Take a break, then join the next 1,000,000 rows etc. until all rows have been joined. 休息一下,然后加入下1,000,000行, 1,000,000类推,直到所有行都已加入。

Is there a function to run the full_join() in batches? 是否有一个函数可以批量运行full_join()

This is just to report the time it takes running with full_join and merge from data.table in a 64 bit Windows system(Intel ~3.5 Ghz, RAM 120GB). 这只是报告使用full_join运行并从64位Windows系统( data.table Ghz,RAM 120GB)中的full_join merge花费的时间。 Hope it will help at least as a reference for your case. 希望它至少可以为您的案例提供参考。

library(data.table)
df1 <- data.table(KEY=sample(1:800,20015,replace = TRUE), 
                  matrix(rnorm(20015*7),20015,7))#1.1MB
df2 <- data.table(KEY=sample(1:800,8534664,replace = TRUE), 
                  matrix(rnorm(8534664*29),8534664,29))#1.9GB
library(dplyr)
tick <- Sys.time()
df_join <- full_join(df1, df2, by = "KEY") #~58.1 GB in memory
tock <- Sys.time()- tick #~1.85min
#With data.table merge.
tick <- Sys.time()
df_join<- merge(df1, df2, by = "KEY", allow.cartesian = TRUE)#~58.1 GB in memory
tock <- Sys.time() - tick #~5.75 mins

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 是否可以在不复制列值的情况下使用 R 中的 full_join 之类的东西连接数据框 - Is it possible to join data frames using something like full_join in R without duplicating column values 使用 full_join 合并两个以上数据帧时的后缀 - Suffixes when merging more than two data frames with full_join 具有超过 2 个 data.frames 后缀的嵌套 full_join - Nested full_join with suffixes for more than 2 data.frames 具有 0 行的 full_join 数据 - full_join data with 0 rows 如何限制full_join()重复项? -R - How to restrict full_join() duplicates? - R 在没有公共变量的情况下执行 dplyr full_join 以混合数据帧 - Performing a dplyr full_join without a common variable to blend data frames 使用 DPLYR full_join 加入 3 个大数据帧时,如何修复错误:std::bad_alloc 消息? - How do I fix Error: std::bad_alloc message when using DPLYR full_join to join 3 large data frames? 您如何保留原始列以在 r 的两个数据库的 full_join() 中进行比较 - How do you retain original column for comparison in full_join() of two databases in r R:两个数据集的 full_join 报告的行数比添加数据集 1 和数据集 2 的行数多 - R: full_join of two datasets reports more rows than adding those of dataset 1 and dataset 2 在 R 中使用 full_join 按顺序连接数据帧列表 - Join list of dataframes in sequence using full_join in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM