簡體   English   中英

快速計算出現次數 R data.table

[英]fast counting of appearences in R data.table

我有一個大的data.table (大約 500 行和 250 萬列)。 列是不同的特征,可以有 4 種狀態(這里是"normal", "medium", "high", NA )。 我想為每個特征計算這些狀態的出現次數。 為此,我編寫了一個基本上可以在較小的dt中運行的腳本。 但是,在我的完整data.table中。 它已經運行了 3 天,但仍未完成。 任何想法如何更快地做到這一點?

# example code

library(data.table)

samples <- c("sample_one", "sample_two", "sample_three", "sample_four", "sample_five", "sample_six", "sample_seven", "sample_eight")
feature_one <- c("normal", "medium", "high", NA, "normal", NA, "high", NA)
feature_two <- c("medium", "medium", "medium", "medium", "high", "medium", "normal", NA)
feature_three <- c("normal", "normal", "high", NA, "normal", "medium", "medium", NA)
feature_four <- c("high", "medium", "normal", "medium", "normal", "medium", "high", "normal")
feature_five <- c("normal", "normal", "normal", NA, "normal", "medium", "medium", "medium")

feature_dt <- data.table(samples = samples,
                         feature_one = feature_one,
                         feature_two = feature_two,
                         feature_three = feature_three,
                         feature_four = feature_four,
                         feature_five = feature_five)

cols <- setdiff(names(feature_dt), "samples")

number_of_vars <- length(cols)

na_counts <- vector("list", number_of_vars)
names(na_counts) <- cols

normal_counts <- vector("list", number_of_vars)
names(normal_counts) <- cols

medium_counts <- vector("list", number_of_vars)
names(medium_counts) <- cols

high_counts <- vector("list", number_of_vars)
names(high_counts) <- cols

for (col in cols) {
  eval(parse(text = paste0("na_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][is.na(", col, "), N]")))
  eval(parse(text = paste0("normal_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"normal\", N]")))
  eval(parse(text = paste0("medium_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"medium\", N]")))
  eval(parse(text = paste0("high_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"high\", N]")))
}

處理長數據(多行,幾列)通常比處理寬數據(多列,幾行)容易得多。 因此,我會將數據轉換為 longg 格式並對其運行聚合

feature_dt[, samples := NULL]
dt_melted <- melt.data.table(feature_dt,
                             measure.vars = names(feature_dt),
                             variable.name = "FEATURE",
                             value.name = "VALUE")
dt_melted[, .N, keyby = .(FEATURE, VALUE)]
#>           FEATURE  VALUE N
#>  1:   feature_one   <NA> 3
#>  2:   feature_one   high 2
#>  3:   feature_one medium 1
#>  4:   feature_one normal 2
#>  5:   feature_two   <NA> 1
#>  6:   feature_two   high 1
#>  7:   feature_two medium 5
#>  8:   feature_two normal 1
#>  9: feature_three   <NA> 2
#> 10: feature_three   high 1
#> 11: feature_three medium 2
#> 12: feature_three normal 3
#> 13:  feature_four   high 2
#> 14:  feature_four medium 3
#> 15:  feature_four normal 3
#> 16:  feature_five   <NA> 1
#> 17:  feature_five medium 3
#> 18:  feature_five normal 4

我們可以在每一列上使用table並綁定到一個表中:

out <- rbindlist(
#   lapply(feature_dt[, .SD, .SDcols = -"samples"], \(z) as.data.table(table(state = z, useNA = "always"))), 
#   idcol = "feature")
out
#           feature  state     N
#            <char> <char> <int>
#  1:   feature_one   high     2
#  2:   feature_one medium     1
#  3:   feature_one normal     2
#  4:   feature_one   <NA>     3
#  5:   feature_two   high     1
#  6:   feature_two medium     5
#  7:   feature_two normal     1
#  8:   feature_two   <NA>     1
#  9: feature_three   high     1
# 10: feature_three medium     2
# 11: feature_three normal     3
# 12: feature_three   <NA>     2
# 13:  feature_four   high     2
# 14:  feature_four medium     3
# 15:  feature_four normal     3
# 16:  feature_four   <NA>     0
# 17:  feature_five medium     3
# 18:  feature_five normal     4
# 19:  feature_five   <NA>     1

如果您隨后希望它采用旋轉/重塑格式,我們當然可以根據您的喜好執行以下操作之一:

dcast(state ~ feature, data = out, value.var = "N", fill = 0L)
#     state feature_five feature_four feature_one feature_three feature_two
#    <char>        <int>        <int>       <int>         <int>       <int>
# 1:   <NA>            1            0           3             2           1
# 2:   high            0            2           2             1           1
# 3: medium            3            3           1             2           5
# 4: normal            4            3           2             3           1

dcast(feature ~ state, data = out, value.var = "N", fill = 0L)
#          feature    NA  high medium normal
#           <char> <int> <int>  <int>  <int>
# 1:  feature_five     1     0      3      4
# 2:  feature_four     0     2      3      3
# 3:   feature_one     3     2      1      2
# 4: feature_three     2     1      2      3
# 5:   feature_two     1     1      5      1

請注意,最后一個表達式中的列名是"NA" ,而不是NA 因此,在后續處理中,您需要引用或反引號(而不是嘗試將其稱為符號NA )。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM