按行將自定義函數應用於data.table會返回不正確的值數量

Question

我是data.tables的新手，並且有一個包含DNA基因組座標的表格，如下所示：

       chrom   pause strand coverage
    1:     1 3025794      +        1
    2:     1 3102057      +        2
    3:     1 3102058      +        2
    4:     1 3102078      +        1
    5:     1 3108840      -        1
    6:     1 3133041      +        1

我編寫了一個自定義函數，希望將其應用於大約200萬行的表的每一行，它使用GenomicFeatures的mapToTranscripts檢索字符串和新坐標形式的兩個相關值。 我想將它們添加到表中的兩個新列中，如下所示：

       chrom   pause strand coverage       transcriptID CDS
    1:     1 3025794      +        1 ENSMUST00000116652 196
    2:     1 3102057      +        2 ENSMUST00000116652  35
    3:     1 3102058      +        2 ENSMUST00000156816 888
    4:     1 3102078      +        1 ENSMUST00000156816 883
    5:     1 3108840      -        1 ENSMUST00000156816 882
    6:     1 3133041      +        1 ENSMUST00000156816 880

該函數如下：

    get_feature <- function(dt){

      coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) 
      hit <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) 
      tx_id <- tx_names[as.character(seqnames(hit))] 
      cds_coordinate <- sapply(ranges(hit), '[[', 1)

      if(length(tx_id) == 0 || length(cds_coordinate) == 0) {  
        out <- list('NaN', 0)
      } else {
        out <- list(tx_id, cds_coordinate)
      }

      return(out)
    }

然后，我這樣做：

    counts[, c("transcriptID", "CDS"):=get_feature(.SD), by = .I]

我收到此錯誤，表明該函數返回的兩個列表的長度比原始表的長度短，而不是每行一個新元素：

Warning messages:
    1: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
      Supplied 1112452 items to be assigned to 1886614 items of column 'transcriptID' (recycled leaving remainder of 774162 items).
    2: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
      Supplied 1112452 items to be assigned to 1886614 items of column 'CDS' (recycled leaving remainder of 774162 items).

我假設使用.I運算符將按行應用該函數，並每行返回一個值。 我還確保使用if語句該函數未返回空值。

然后，我嘗試了該功能的模擬版本：

    get_feature <- function(dt) {

      return('I should be returned once for each row')

    }

並這樣稱呼它：

    new.table <- counts[, get_feature(.SD), by = .I]

它使一個1行數據表，而不是原始長度的一個。 因此，我得出的結論是，我的函數或我所說的函數以某種方式正在折疊所得向量的元素。 我究竟做錯了什么？

更新（帶有解決方案）：正如@StatLearner指出的那樣，此答案中的解釋是，如?data.table ， .I僅用於j （如DT[i,j,by=] ）。 因此， by=.I等效於by=NULL ，正確的語法是by=1:nrow(dt) ，以便按行號分組並逐行應用該函數。

不幸的是，對於我的特殊情況，這是完全低效的，我計算出100行的執行時間為20秒。 對於我的3600萬行數據集，需要3個月才能完成。

就我而言，我不得不像這樣放棄對整個表的mapToTranscripts函數，這需要花費幾秒鍾的時間，並且顯然是預期的用途。

    get_features <- function(dt){
      coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) # define coordinate
      hits <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) # map it to a transcript
      tx_hit <- as.character(seqnames(hits)) # get transcript number
      tx_id <- tx_names[tx_hit] # get transcript name from translation table

      return(data.table('transcriptID'= tx_id, 
                       'CDS_coordinate' =  start(hits))
    }

     density <- counts[, get_features(.SD)]

然后使用來自GenomicFeatures包中的mapFromTranscripts映射回基因組，以便我可以使用data.tables從原始表中檢索信息，這正是我想要做的事情。

Answer 1

當我需要為data.table中的每一行應用一個函數時，我的處理方式是按行號對它進行分組：

counts[, get_feature(.SD), by = 1:nrow(counts)]

如此答案中所述， .I不適用於in by因為它應該返回通過分組產生的行索引序列。 by = .I不會引發錯誤的原因是data.table創建了對象.I等於data.table命名空間中的NULL ，因此by = .I等於by = NULL 。

請注意，使用by=1:nrow(dt)按行號分組，並允許您的函數僅從data.table中訪問單個行：

require(data.table)
counts <- data.table(chrom = sample.int(10, size = 100, replace = TRUE),
                     pause = sample((3 * 10^6):(3.2 * 10^6), size = 100), 
                     strand = sample(c('-','+'), size = 100, replace = TRUE),
                     coverage = sample.int(3, size = 100, replace = TRUE))

get_feature <- function(dt){
    coordinate <- data.frame(dt$chrom, dt$pause, dt$strand)
    rowNum <- nrow(coordinate)
    return(list(text = 'Number of rows in dt', rowNum = rowNum))  
}

counts[, get_feature(.SD), by = 1:nrow(counts)]

會產生一個data.table，其行數與counts ，但是coordinate將只包含counts一行

   nrow                 text rowNum
1:    1 Number of rows in dt      1
2:    2 Number of rows in dt      1
3:    3 Number of rows in dt      1
4:    4 Number of rows in dt      1
5:    5 Number of rows in dt      1

而by = NULL將為函數提供整個data.table：

counts[, get_feature(.SD), by = NULL]

                   text rowNum
1: Number of rows in dt    100

這是預期的方式by工作。

按行將自定義函數應用於data.table會返回不正確的值數量

問題描述

1 個解決方案

解決方案1
4 已采納 2017-01-10 08:55:01

按行將自定義函數應用於data.table會返回不正確的值數量

問題描述

1 個解決方案

解決方案1 4 已采納 2017-01-10 08:55:01

解決方案1
4 已采納 2017-01-10 08:55:01