简体   繁体   English

按代码列排序多个data.tables在R中进入相同数量的data.tables而不绑定data.tables(由于内存限制)

[英]Sort by code column multiple data.tables in R into same number of data.tables without binding the data.tables (due to memory limits)

I have many CSV's containing a huge amount of data that is unsorted by code across all CSV's in the set. 我有很多CSV包含大量数据,这些数据在集合中的所有CSV中都没有按代码排序。 I'd like to sort the codes across the whole set saving groups of codes to CSV's together, keeping the same number of CSV's as before when they were unsorted. 我想将整个设置保存代码组中的代码一起排序到CSV,并保持与未分类时相同的CSV数。 I can't bind them together, sort, and split (as I usually would) because I have to keep the CSV's separated due to memory limits. 我无法将它们绑定在一起,排序和拆分(我通常会这样),因为由于内存限制,我必须保持CSV分离。 My real dataset is billions of lines split across hundreds of CSV's like this. 我的真实数据集是数十亿行像这样分成数百个CSV。

For example, if after fread each of the data table examples below: 例如,如果在fread下面的每个数据表示例之后:

Reproducible data: 可重复的数据:

###Really I would fread() each of these, but reproducible here
data1 <- data.table(code=rep(c(1:2000),times=500),
                   data1=rep(c(10001:12000),times=500), 
                   data2=rep(c(20001:22000),times=500))
data2 <- data.table(code=rep(c(1:2000),times=500),
                    data1=rep(c(10001:12000),times=500), 
                    data2=rep(c(20001:22000),times=500))
data3 <- data.table(code=rep(c(1:2000),times=500),
                    data1=rep(c(10001:12000),times=500), 
                    data2=rep(c(20001:22000),times=500))
data4 <- data.table(code=rep(c(1:2000),times=500),
                    data1=rep(c(10001:12000),times=500), 
                    data2=rep(c(20001:22000),times=500))

I'd like to sort by the code for each of data (there is a variable number in reality) and save as the same number of csv's 我想按每个数据的代码排序(实际上有一个变量号)并保存为相同数量的csv

The below is an example of the above data in the format I'd like. 以下是我喜欢的格式的上述数据的示例。 So there are codes 1-2000 on the original data.tables, here the codes are split so codes 1:500 is on desired1, codes 501:1000 are on desired2, codes 1001:1500 are on desired3, and codes 1501:2000 are on desired4. 所以在原始data.tables上有代码1-2000,这里代码是分开的,因此代码1:500在期望1上,代码501:1000在期望2上,代码1001:1500在期望3上,代码1501:2000是在desired4上。

Reproducible desired data: 可重复的所需数据:

###I'd use fwrite to save each one of these as a csv to file

desired1 <- data.table(code=rep(c(1:500),times=2000),
                                data1=rep(c(10001:10500),times=2000), 
                                data2=rep(c(20001:20500),times=2000))
desired2 <- data.table(code=rep(c(501:1000),times=2000),
                                data1=rep(c(10501:11000),times=2000), 
                                data2=rep(c(20501:21000),times=2000))
desired3 <- data.table(code=rep(c(1001:1500),times=2000),
                                data1=rep(c(11001:11500),times=2000), 
                                data2=rep(c(21001:21500),times=2000))
desired4 <- data.table(code=rep(c(1501:2000),times=2000),
                                data1=rep(c(11501:12000),times=2000), 
                                data2=rep(c(21501:22000),times=2000))

In reality I have 500 or more CSV's. 实际上我有500或更多的CSV。 What is the fastest way to sort them then save all of the same code to the same csv, while still splitting across the same number of csv's as the original unsorted files? 排序它们的最快方法是什么,然后将所有相同的代码保存到同一个csv,同时仍然分割与原始未排序文件相同数量的csv? Thanks in advance! 提前致谢!

A for loop that sequentially rbind would be memory efficient 顺序rbind for循环将是内存有效的

out <- data1[code %in% 1:500]
for(i in 2:4) out <- rbind(out, get(paste0('data', i))[code %in% 1:500])
identical(out, desired1) 
#[1] TRUE 
mm = function(x){
  a = table(x)
  rep(1:unique(a),length(a))
}

Map(function(x,y)set(x,j="code",value=mm(x[,code])+y),mget(ls(pattern = "data")),c(0,500,1000,1500))

$data4
         code data1 data2
      1: 1501 10001 20001
      2: 1502 10002 20002
      3: 1503 10003 20003
      4: 1504 10004 20004
      5: 1505 10005 20005
     ---                 
 999996: 1996 11996 21996
 999997: 1997 11997 21997
 999998: 1998 11998 21998
 999999: 1999 11999 21999
1000000: 2000 12000 22000

This changes the original data as it calls by reference. 这会通过引用调整原始数据。 ie try calling data2 you will see that it has changed. 即尝试调用data2你会发现它已经改变了。 If you do not want this behavior, you might consider using the function copy ie set(copy(x),.... 如果您不想要这种行为,可以考虑使用函数copyset(copy(x),....

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM