简体   繁体   English

包含许多二元分类列的 data.table 中观察的流行率估计值

[英]Prevalence Estimates from Observations in data.table Containing Many Binary Classification Columns

I am doing prevalence estimates from my raw data.table by brute force and I need to be more efficient.我正在通过蛮力从我的原始 data.table 进行流行率估计,我需要提高效率。 Can you help?你能帮我吗?

My data.table contains one weighted observation per row.我的 data.table 每行包含一个加权观察值。 There are many columns acting as binary dummy variables indicating if the particular observation belongs to one or more of many possible classifications.有许多列充当二进制虚拟变量,指示特定观察是否属于许多可能分类中的一个或多个。 (eg, a story could be 'amazing', 'boring', or 'charming', or any combination of the three.) (例如,一个故事可以是“惊人的”、“无聊的”或“迷人的”,或三者的任意组合。)

There's got to be a data.table way to replace my forloop.必须有一种data.table方式来替换我的 forloop。 I also suspect that I might not need to necessarily generate the queries set.我还怀疑我可能不需要生成queries集。 I appreciate a fresh set of eyes on this problem.我很欣赏对这个问题的全新看法。

library(data.table)

set.seed(42)
# I have many weighted observations that can be labeled as belonging to one of many categories
# in this example, I simulate 42 observations and only 3 categories
dt = data.table(
        weight = runif( n = 42 , min = 0, max = 1 ),
        a = sample( x = c(0,1) , size = 42 , replace = TRUE ),
        b = sample( x = c(0,1) , size = 42 , replace = TRUE ),
        c = sample( x = c(0,1) , size = 42 , replace = TRUE )
)

# Generate all combinations of categories
queries = as.data.table( expand.grid( rep( list(0:1) , length(names(dt))-1 ) ) )
names(queries) = names(dt)[ 2:length(names(dt)) ] # rename Var1, Var2, Var3 to a, b, c

# Brute force through each possible combination to calculate prevalence
prevalence = rep( NA, nrow(queries) )
for( q in 1:nrow(queries) ){
    prevalence[q] = dt[ a == queries[q, a] & b == queries[q, b] & c == queries[q, c] , sum(weight) ] / dt[ , sum(weight) ]
}

results = copy(queries)
results$prevalence = prevalence

results

The output is: output 是:

   a b c prevalence
1: 0 0 0 0.10876301
2: 1 0 0 0.18204696
3: 0 1 0 0.03775363
4: 1 1 0 0.25629705
5: 0 0 1 0.02135357
6: 1 0 1 0.15197811
7: 0 1 1 0.12806864
8: 1 1 1 0.11373903

You can calculate it by group你可以按组计算

dt[,.( prevalence = sum(weight) / dt[,sum(weight)] ), by = .(a,b,c)]
  • each group corresponds to your categories每个组对应于您的类别
  • sum the weight of each group then divide it by totoal weight将每组的weight相加,然后除以总权重

Here are some solutions (in both cases, you can replace keyby argument with by )以下是一些解决方案(在这两种情况下,您都可以将keyby参数替换为by

If your dataset ( dt ) already contains all possible combinations of the different categories, then you could do (as in @Peace Wang solution )如果您的数据集( dt )已经包含不同类别的所有可能组合,那么您可以这样做(如@Peace Wang 解决方案

dt[, .(prevalence = sum(weight)/sum(dt$weight)), keyby=.(a, b, c)]

#        a     b     c prevalence
# 1:     0     0     0 0.10876301
# 2:     0     0     1 0.02135357
# 3:     0     1     0 0.03775363
# 4:     0     1     1 0.12806864
# 5:     1     0     0 0.18204696
# 6:     1     0     1 0.15197811
# 7:     1     1     0 0.25629705
# 8:     1     1     1 0.11373903

Instead, if the dataset does not contain all possible combinations of the different categories, then you could solve it as follows ( CJ(a, b, c, unique=TRUE) computes all combinations and remove duplicates)相反,如果数据集不包含不同类别的所有可能组合,那么您可以按如下方式解决它( CJ(a, b, c, unique=TRUE)计算所有组合并删除重复项)

dt[CJ(a, b, c, unique=TRUE), .(prevalence = sum(weight)/sum(dt$weight)), keyby=.(a, b, c), on=.(a, b, c)]

#        a     b     c prevalence
# 1:     0     0     0 0.10876301
# 2:     0     0     1 0.02135357
# 3:     0     1     0 0.03775363
# 4:     0     1     1 0.12806864
# 5:     1     0     0 0.18204696
# 6:     1     0     1 0.15197811
# 7:     1     1     0 0.25629705
# 8:     1     1     1 0.11373903

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM