简体   繁体   English

快速计算出现次数 R data.table

[英]fast counting of appearences in R data.table

I am having a large data.table (around 500 rows and 2.5 million columns).我有一个大的data.table (大约 500 行和 250 万列)。 Columns are different features that can have 4 states (here "normal", "medium", "high", NA ).列是不同的特征,可以有 4 种状态(这里是"normal", "medium", "high", NA )。 I want to count the appearances of these states for every feature.我想为每个特征计算这些状态的出现次数。 For this I wrote a script that is basically working in smaller dt .为此,我编写了一个基本上可以在较小的dt中运行的脚本。 However, in my full data.table .但是,在我的完整data.table中。 It's running since 3 days now and still not finished.它已经运行了 3 天,但仍未完成。 Any ideas how to get that faster?任何想法如何更快地做到这一点?

# example code

library(data.table)

samples <- c("sample_one", "sample_two", "sample_three", "sample_four", "sample_five", "sample_six", "sample_seven", "sample_eight")
feature_one <- c("normal", "medium", "high", NA, "normal", NA, "high", NA)
feature_two <- c("medium", "medium", "medium", "medium", "high", "medium", "normal", NA)
feature_three <- c("normal", "normal", "high", NA, "normal", "medium", "medium", NA)
feature_four <- c("high", "medium", "normal", "medium", "normal", "medium", "high", "normal")
feature_five <- c("normal", "normal", "normal", NA, "normal", "medium", "medium", "medium")

feature_dt <- data.table(samples = samples,
                         feature_one = feature_one,
                         feature_two = feature_two,
                         feature_three = feature_three,
                         feature_four = feature_four,
                         feature_five = feature_five)

cols <- setdiff(names(feature_dt), "samples")

number_of_vars <- length(cols)

na_counts <- vector("list", number_of_vars)
names(na_counts) <- cols

normal_counts <- vector("list", number_of_vars)
names(normal_counts) <- cols

medium_counts <- vector("list", number_of_vars)
names(medium_counts) <- cols

high_counts <- vector("list", number_of_vars)
names(high_counts) <- cols

for (col in cols) {
  eval(parse(text = paste0("na_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][is.na(", col, "), N]")))
  eval(parse(text = paste0("normal_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"normal\", N]")))
  eval(parse(text = paste0("medium_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"medium\", N]")))
  eval(parse(text = paste0("high_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"high\", N]")))
}

It is usually much easier to work on long data (many rows, few columns) than on wide data (many columns, few rows).处理长数据(多行,几列)通常比处理宽数据(多列,几行)容易得多。 Threfore, I would convert the data into longg format and run the aggregation on it因此,我会将数据转换为 longg 格式并对其运行聚合

feature_dt[, samples := NULL]
dt_melted <- melt.data.table(feature_dt,
                             measure.vars = names(feature_dt),
                             variable.name = "FEATURE",
                             value.name = "VALUE")
dt_melted[, .N, keyby = .(FEATURE, VALUE)]
#>           FEATURE  VALUE N
#>  1:   feature_one   <NA> 3
#>  2:   feature_one   high 2
#>  3:   feature_one medium 1
#>  4:   feature_one normal 2
#>  5:   feature_two   <NA> 1
#>  6:   feature_two   high 1
#>  7:   feature_two medium 5
#>  8:   feature_two normal 1
#>  9: feature_three   <NA> 2
#> 10: feature_three   high 1
#> 11: feature_three medium 2
#> 12: feature_three normal 3
#> 13:  feature_four   high 2
#> 14:  feature_four medium 3
#> 15:  feature_four normal 3
#> 16:  feature_five   <NA> 1
#> 17:  feature_five medium 3
#> 18:  feature_five normal 4

We can use table on each column and bind into a single table:我们可以在每一列上使用table并绑定到一个表中:

out <- rbindlist(
#   lapply(feature_dt[, .SD, .SDcols = -"samples"], \(z) as.data.table(table(state = z, useNA = "always"))), 
#   idcol = "feature")
out
#           feature  state     N
#            <char> <char> <int>
#  1:   feature_one   high     2
#  2:   feature_one medium     1
#  3:   feature_one normal     2
#  4:   feature_one   <NA>     3
#  5:   feature_two   high     1
#  6:   feature_two medium     5
#  7:   feature_two normal     1
#  8:   feature_two   <NA>     1
#  9: feature_three   high     1
# 10: feature_three medium     2
# 11: feature_three normal     3
# 12: feature_three   <NA>     2
# 13:  feature_four   high     2
# 14:  feature_four medium     3
# 15:  feature_four normal     3
# 16:  feature_four   <NA>     0
# 17:  feature_five medium     3
# 18:  feature_five normal     4
# 19:  feature_five   <NA>     1

If you then want it in a pivoted/reshaped format, we can certainly do one of the following, depending on your preferences:如果您随后希望它采用旋转/重塑格式,我们当然可以根据您的喜好执行以下操作之一:

dcast(state ~ feature, data = out, value.var = "N", fill = 0L)
#     state feature_five feature_four feature_one feature_three feature_two
#    <char>        <int>        <int>       <int>         <int>       <int>
# 1:   <NA>            1            0           3             2           1
# 2:   high            0            2           2             1           1
# 3: medium            3            3           1             2           5
# 4: normal            4            3           2             3           1

dcast(feature ~ state, data = out, value.var = "N", fill = 0L)
#          feature    NA  high medium normal
#           <char> <int> <int>  <int>  <int>
# 1:  feature_five     1     0      3      4
# 2:  feature_four     0     2      3      3
# 3:   feature_one     3     2      1      2
# 4: feature_three     2     1      2      3
# 5:   feature_two     1     1      5      1

Note that the column name in the last expression is "NA" , not NA ;请注意,最后一个表达式中的列名是"NA" ,而不是NA as such, in follow-on processing you'll need to quote or backtick it (instead of trying to refer to it as the symbol NA ).因此,在后续处理中,您需要引用或反引号(而不是尝试将其称为符号NA )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM