[英]fast counting of appearences in R data.table
我有一個大的data.table
(大約 500 行和 250 萬列)。 列是不同的特征,可以有 4 種狀態(這里是"normal", "medium", "high", NA
)。 我想為每個特征計算這些狀態的出現次數。 為此,我編寫了一個基本上可以在較小的dt
中運行的腳本。 但是,在我的完整data.table
中。 它已經運行了 3 天,但仍未完成。 任何想法如何更快地做到這一點?
# example code
library(data.table)
samples <- c("sample_one", "sample_two", "sample_three", "sample_four", "sample_five", "sample_six", "sample_seven", "sample_eight")
feature_one <- c("normal", "medium", "high", NA, "normal", NA, "high", NA)
feature_two <- c("medium", "medium", "medium", "medium", "high", "medium", "normal", NA)
feature_three <- c("normal", "normal", "high", NA, "normal", "medium", "medium", NA)
feature_four <- c("high", "medium", "normal", "medium", "normal", "medium", "high", "normal")
feature_five <- c("normal", "normal", "normal", NA, "normal", "medium", "medium", "medium")
feature_dt <- data.table(samples = samples,
feature_one = feature_one,
feature_two = feature_two,
feature_three = feature_three,
feature_four = feature_four,
feature_five = feature_five)
cols <- setdiff(names(feature_dt), "samples")
number_of_vars <- length(cols)
na_counts <- vector("list", number_of_vars)
names(na_counts) <- cols
normal_counts <- vector("list", number_of_vars)
names(normal_counts) <- cols
medium_counts <- vector("list", number_of_vars)
names(medium_counts) <- cols
high_counts <- vector("list", number_of_vars)
names(high_counts) <- cols
for (col in cols) {
eval(parse(text = paste0("na_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][is.na(", col, "), N]")))
eval(parse(text = paste0("normal_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"normal\", N]")))
eval(parse(text = paste0("medium_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"medium\", N]")))
eval(parse(text = paste0("high_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"high\", N]")))
}
處理長數據(多行,幾列)通常比處理寬數據(多列,幾行)容易得多。 因此,我會將數據轉換為 longg 格式並對其運行聚合
feature_dt[, samples := NULL]
dt_melted <- melt.data.table(feature_dt,
measure.vars = names(feature_dt),
variable.name = "FEATURE",
value.name = "VALUE")
dt_melted[, .N, keyby = .(FEATURE, VALUE)]
#> FEATURE VALUE N
#> 1: feature_one <NA> 3
#> 2: feature_one high 2
#> 3: feature_one medium 1
#> 4: feature_one normal 2
#> 5: feature_two <NA> 1
#> 6: feature_two high 1
#> 7: feature_two medium 5
#> 8: feature_two normal 1
#> 9: feature_three <NA> 2
#> 10: feature_three high 1
#> 11: feature_three medium 2
#> 12: feature_three normal 3
#> 13: feature_four high 2
#> 14: feature_four medium 3
#> 15: feature_four normal 3
#> 16: feature_five <NA> 1
#> 17: feature_five medium 3
#> 18: feature_five normal 4
我們可以在每一列上使用table
並綁定到一個表中:
out <- rbindlist(
# lapply(feature_dt[, .SD, .SDcols = -"samples"], \(z) as.data.table(table(state = z, useNA = "always"))),
# idcol = "feature")
out
# feature state N
# <char> <char> <int>
# 1: feature_one high 2
# 2: feature_one medium 1
# 3: feature_one normal 2
# 4: feature_one <NA> 3
# 5: feature_two high 1
# 6: feature_two medium 5
# 7: feature_two normal 1
# 8: feature_two <NA> 1
# 9: feature_three high 1
# 10: feature_three medium 2
# 11: feature_three normal 3
# 12: feature_three <NA> 2
# 13: feature_four high 2
# 14: feature_four medium 3
# 15: feature_four normal 3
# 16: feature_four <NA> 0
# 17: feature_five medium 3
# 18: feature_five normal 4
# 19: feature_five <NA> 1
如果您隨后希望它采用旋轉/重塑格式,我們當然可以根據您的喜好執行以下操作之一:
dcast(state ~ feature, data = out, value.var = "N", fill = 0L)
# state feature_five feature_four feature_one feature_three feature_two
# <char> <int> <int> <int> <int> <int>
# 1: <NA> 1 0 3 2 1
# 2: high 0 2 2 1 1
# 3: medium 3 3 1 2 5
# 4: normal 4 3 2 3 1
dcast(feature ~ state, data = out, value.var = "N", fill = 0L)
# feature NA high medium normal
# <char> <int> <int> <int> <int>
# 1: feature_five 1 0 3 4
# 2: feature_four 0 2 3 3
# 3: feature_one 3 2 1 2
# 4: feature_three 2 1 2 3
# 5: feature_two 1 1 5 1
請注意,最后一個表達式中的列名是"NA"
,而不是NA
; 因此,在后續處理中,您需要引用或反引號(而不是嘗試將其稱為符號NA
)。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.