简体   繁体   English

Rbind有新列和data.table

[英]Rbind with new columns and data.table

I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. 我需要在现有表中添加许多大表,所以我使用rbind和优秀的包data.table。 But some of the later tables have more columns than the original one (which need to be included). 但是后面的一些表有比原始列更多的列(需要包含它们)。 Is there an equivalent of rbind.fill for data.table? 是否有相当于data.table的rbind.fill?

library(data.table)

aa <- c(1,2,3)
bb <- c(2,3,4)
cc <- c(3,4,5)

dt.1 <- data.table(cbind(aa, bb))
dt.2 <- data.table(cbind(aa, bb, cc))

dt.11 <- rbind(dt.1, dt.1)  # Works, but not what I need
dt.12 <- rbind(dt.1, dt.2)  # What I need, doesn't work
dt.12 <- rbind.fill(dt.1, dt.2)  # What I need, doesn't work either

I need to start rbinding before I have all tables, so no way to know what future new columns will be called. 我需要在拥有所有表之前开始rbinding,因此无法知道将来会调用哪些新列。 Missing data can be filled with NA. 缺少的数据可以用NA填充。

Since v1.9.2 , data.table 's rbind function gained fill argument. v1.9.2data.table的rbind函数获得了fill参数。 From ?rbind.data.table documentation: 来自?rbind.data.table文档:

If TRUE fills missing columns with NAs. 如果TRUE使用NA填充缺少的列。 By default FALSE. 默认为FALSE。 When TRUE, use.names has to be TRUE, and all items of the input list has to have non-null column names. 如果为TRUE,则use.names必须为TRUE,并且输入列表的所有项都必须具有非空列名。

Thus you can do (prior to approx v1.9.6): 因此你可以做到(大约在v1.9.6之前):

data.table::rbind(dt.1, dt.2, fill=TRUE) 
#    aa bb cc
# 1:  1  2 NA
# 2:  2  3 NA
# 3:  3  4 NA
# 4:  1  2  3
# 5:  2  3  4
# 6:  3  4  5

UPDATE for v1.9.6: v1.9.6的更新:

This now works directly: 现在可以直接使用:

rbind(dt.1, dt.2, fill=TRUE)
#    aa bb cc
# 1:  1  2 NA
# 2:  2  3 NA
# 3:  3  4 NA
# 4:  1  2  3
# 5:  2  3  4
# 6:  3  4  5

Here is an approach that will update the missing columns in 这是一种更新缺失列的方法

rbind.missing <- function(A, B) { 

  cols.A <- names(A)
  cols.B <- names(B)

  missing.A <- setdiff(cols.B,cols.A)
  # check and define missing columns in A
  if(length(missing.A) > 0L){
   class.missing.A <- lapply(B[,missing.A,with = FALSE], class)
   nas.A <- lapply(class.missing.A, as, object = NA)
   A[,c(missing.A) := nas.A]
  }
  # check and define missing columns in B
  missing.B <- setdiff(names(A), cols.B)
  if(length(missing.B) > 0L){
    class.missing.B <- lapply(A[,missing.B,with = FALSE], class)
    nas.B <- lapply(class.missing.B, as, object = NA)
    B[,c(missing.B) := nas.B]
  }
  # reorder so they are the same
  setcolorder(B, names(A))
  rbind(A, B)

}

rbind.missing(dt.1,dt.2)

##    aa bb cc
## 1:  1  2 NA
## 2:  2  3 NA
## 3:  3  4 NA
## 4:  1  2  3
## 5:  2  3  4
## 6:  3  4  5

This will not be efficient for many, or large data.tables, as it only works two at a time. 这对于许多或大型data.tables来说效率不高,因为它一次只能工作两个。

the basic concept is to add missing columns in both directions: from the running master table to the newTable and back the other way. 基本概念是在两个方向上添加缺少的列:从正在运行的master表到newTable以及从另一个方向返回。

As @menl pointed out in the comments, simply assigning an NA is a problem, because that will make the whole column of class logical . 正如@menl在评论中指出的那样,简单地指定NA是一个问题,因为这将使整个classlogical

One solution is to force all columns of a single type (ie as.numeric(NA) ), but that is too restrictive. 一种解决方案是强制单个类型的所有列(即as.numeric(NA) ),但这是过于严格的。

Instead, we need to analyze each new column for its class. 相反,我们需要分析其类的每个新列。 We can then use as(NA, cc) _( cc being the class) as the vector that we will assign to a new column. 然后我们可以使用as(NA, cc) _( cc作为类)作为我们将分配给新列的向量。 We wrap this in an lapply statement on the RHS and use eval(columnName) on the LHS to assign. 我们将其包装在RHS上的lapply语句中,并使用LHS上的eval(columnName)进行分配。

We can then wrap this in a function and use S3 methods so that we can simply call 然后我们可以将它包装在一个函数中并使用S3方法,这样我们就可以简单地调用它

rbindFill(A, B)

Below is the function. 以下是功能。

rbindFill.data.table <- function(master, newTable)  {
# Append newTable to master

    # assign to Master
    #-----------------#
      # identify columns missing
      colMisng     <- setdiff(names(newTable), names(master))

      # if there are no columns missing, move on to next part
      if (!identical(colMisng, character(0)))  {
           # identify class of each
            colMisng.cls <- sapply(colMisng, function(x) class(newTable[[x]]))

            # assign to each column value of NA with appropriate class 
            master[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
          }

    # assign to newTable
    #-----------------#
      # identify columns missing
      colMisng     <- setdiff(names(master), names(newTable))

      # if there are no columns missing, move on to next part
      if (!identical(colMisng, character(0)))  {
        # identify class of each
        colMisng.cls <- sapply(colMisng, function(x) class(master[[x]]))

        # assign to each column value of NA with appropriate class 
        newTable[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
      }

    # reorder columns to avoid warning about ordering
    #-----------------#
      colOrdering <- colOrderingByOtherCol(newTable, names(master))
      setcolorder(newTable,  colOrdering)

    # rbind them! 
    #-----------------#
      rbind(master, newTable)
  }

  # implement generic function
  rbindFill <- function(x, y, ...) UseMethod("rbindFill")


Example Usage: 用法示例:

    # Sample Data: 
    #--------------------------------------------------#
    A  <- data.table(a=1:3, b=1:3, c=1:3)
    A2 <- data.table(a=6:9, b=6:9, c=6:9)
    B  <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
    C  <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
    #--------------------------------------------------#

    # Four iterations of calling rbindFill
    master <- rbindFill(A, B)
    master <- rbindFill(master, A2)
    master <- rbindFill(master, C)

    # Results:
    master
    #      a  b c  d  m     n     f
    #  1:  1  1 1 NA NA    NA    NA
    #  2:  2  2 2 NA NA    NA    NA
    #  3:  3  3 3 NA NA    NA    NA
    #  4: NA  1 1  1  A    NA    NA
    #  5: NA  2 2  2  B    NA    NA
    #  6: NA  3 3  3  C    NA    NA
    #  7:  6  6 6 NA NA    NA    NA
    #  8:  7  7 7 NA NA    NA    NA
    #  9:  8  8 8 NA NA    NA    NA
    # 10:  9  9 9 NA NA    NA    NA
    # 11: NA NA 7 NA NA  0.86  TRUE
    # 12: NA NA 8 NA NA -1.15 FALSE
    # 13: NA NA 9 NA NA  1.10  TRUE

Yet another way to insert the missing columns (with the correct type and NAs) is to merge() the first data.table A with an empty data.table A2[0] which has the structure of the second data.table. 插入缺失列(具有正确类型和NA)的另一种方法是merge()第一个data.table A和空data.table A2[0] ,其具有第二个data.table的结构。 This saves the possibility to introduce bugs in user functions (I know merge() is more reliable than my own code ;)). 这节省了在用户函数中引入错误的可能性(我知道merge()比我自己的代码更可靠;))。 Using mnel's tables from above, do something like the code below. 使用上面的mnel表,执行类似下面的代码。

Also, using rbindlist() should be much faster when dealing with data.tables . 此外,在处理data.tables时,使用rbindlist()应该会快得多。

Define the tables (same as mnel's code above): 定义表(与上面的mnel代码相同):

library(data.table)
A  <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B  <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C  <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)

Insert the missing variables in table A: (note the use of A2[0] 在表A中插入缺失的变量:(注意使用A2[0]

A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE)

Insert the missing columns in table A2: 在表A2中插入缺少的列:

A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE)

Now A and A2 should have the same columns, with the same types. 现在AA2应该具有相同的列,具有相同的类型。 Set the column order to match, just in case (possibly not needed, not sure if rbindlist() binds across column names or column positions): 设置列顺序以匹配,以防万一(可能不需要,不确定rbindlist()是否跨列名或列位置绑定):

setcolorder(A2, names(A))
DT.ALL <- rbindlist(l=list(A,A2))
DT.ALL

Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand... 对其他表重复...也许将它放入函数而不是手工重复会更好...

DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE)
B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE)
setcolorder(B, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, B))

DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE)
C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE)
setcolorder(C, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, C))
DT.ALL

The result looks the same as mnels' output (except for the random numbers and the column order). 结果看起来与mnels的输出相同(除了随机数和列顺序)。

PS1: The original author does not say what to do if there are matching variables -- do we really want to do a rbind() or are we thinking of a merge() ? PS1:原作者没有说如果有匹配变量该怎么办 - 我们真的想做一个rbind()还是我们想要一个merge()

PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question . PS2 :(因为我没有足够的声誉来评论)问题的要点似乎与这个问题重复。 Also important for the benchmarking of data.table vs. plyr with large datasets. 对于使用大型数据集的data.tableplyr的基准测试也很重要。

答案很棒,但看起来像这里建议的一些功能,如plyr :: rbind.fill和gtools :: smartbind,这对我来说似乎很有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM