r结合来自多个data.frame的因子水平

Question

如何合并两个空数据框架中的因子水平？

我有一个大数据集，分为多个单独的文件。 我需要一个data.frame，它应具有因子列的所有可能级别，但我不能一次加载所有部分，只能逐部分加载。

有没有办法做类似的事情：

data_structure = NULL
for (chunk_i in chunks){
    # load chunk_i data

    if(is.null(data_structure)){
        data_structure = data_i
    } else {
        # at this line factor levels will NOT be combined as I expect
        # but instead factor levels from 'data' will be stored to 'data_structure'
        data_structure = rbind(data_structure, data)
    }
    rm(data)

    # empty data frame, since I can't keep all data in memory
    # I want to keep only metadata, like factor levels
    data_structure = data_structure[0, ]
}

并且需要此data_structure以便稍后将因子转换为二进制列，如下所示：

result_i = model.matrix(~ . + 0, data=data_i, contrasts.arg = 
              lapply(data_structure, contrasts, contrasts=FALSE))

如果从数据的所有部分收集因子级别，那么我可以确定result_i将具有与数据的所有其他部分完全相同的二进制列，即使在这种特殊情况下data_i在某些列中的因子级别更少。

UPDATE

现在，我使用以下解决方案：

all_levels = list()
for_each_chunk(function(data) {
    data_levels = Filter(Negate(is.null), sapply(data, levels))
    factor_names = unique(c(names(all_levels), names(data_levels)))
    lapply(factor_names, FUN=function(name){ 
        all_levels[[name]] <<- unique(c(all_levels[[name]], data_levels[[name]]))
    })
})

还不如我优雅，但还没有发现更好的选择。

Answer 1

我提出的解决方案可能很愚蠢。 为什么不对每个块分别进行分层抽样，然后将这些块读入单个数据帧中。 这样，我认为所有级别都将存储在元数据中。 您可以使用R中的sampling包进行分层sampling ，也可以使用有时从GIT集线器中拾取的此函数：

stratified <- function(df, group, size, select = NULL, 
                       replace = FALSE, bothSets = FALSE) {
  if (is.null(select)) {
    df <- df
  } else {
    if (is.null(names(select))) stop("'select' must be a named list")
    if (!all(names(select) %in% names(df)))
      stop("Please verify your 'select' argument")
    temp <- sapply(names(select),
                   function(x) df[[x]] %in% select[[x]])
    df <- df[rowSums(temp) == length(select), ]
  }
  df.interaction <- interaction(df[group], drop = TRUE)
  df.table <- table(df.interaction)
  df.split <- split(df, df.interaction)
  if (length(size) > 1) {
    if (length(size) != length(df.split))
      stop("Number of groups is ", length(df.split),
           " but number of sizes supplied is ", length(size))
    if (is.null(names(size))) {
      n <- setNames(size, names(df.split))
      message(sQuote("size"), " vector entered as:\n\nsize = structure(c(",
              paste(n, collapse = ", "), "),\n.Names = c(",
              paste(shQuote(names(n)), collapse = ", "), ")) \n\n")
    } else {
      ifelse(all(names(size) %in% names(df.split)),
             n <- size[names(df.split)],
             stop("Named vector supplied with names ",
                  paste(names(size), collapse = ", "),
                  "\n but the names for the group levels are ",
                  paste(names(df.split), collapse = ", ")))
    }
  } else if (size < 1) {
    n <- round(df.table * size, digits = 0)
  } else if (size >= 1) {
    if (all(df.table >= size) || isTRUE(replace)) {
      n <- setNames(rep(size, length.out = length(df.split)),
                    names(df.split))
    } else {
      message(
        "Some groups\n---",
        paste(names(df.table[df.table < size]), collapse = ", "),
        "---\ncontain fewer observations",
        " than desired number of samples.\n",
        "All observations have been returned from those groups.")
      n <- c(sapply(df.table[df.table >= size], function(x) x = size),
             df.table[df.table < size])
    }
  }
  temp <- lapply(
    names(df.split),
    function(x) df.split[[x]][sample(df.table[x],
                                     n[x], replace = replace), ])
  set1 <- do.call("rbind", temp)

  if (isTRUE(bothSets)) {
    set2 <- df[!rownames(df) %in% rownames(set1), ]
    list(SET1 = set1, SET2 = set2)
  } else {
    set1
  }
}

r结合来自多个data.frame的因子水平

问题描述

1 个解决方案

解决方案1
0 2015-05-10 07:37:57

r结合来自多个data.frame的因子水平

问题描述

1 个解决方案

解决方案1 0 2015-05-10 07:37:57

解决方案1
0 2015-05-10 07:37:57