对连续的预测变量进行分类并计算二进制结果的比例

Question

I'm testing for non-linearity for the relationship between different continuous variables and a binary outcome. 我正在测试不同连续变量和二进制结果之间的非线性关系。 I would like a fast and efficient way to plot outcome probability by categorized variable. 我想要一种快速有效的方法来按分类变量绘制结果概率。 Here's what I've got, but it seems clunky: 这是我所拥有的，但似乎很笨拙：

First, data: 一，数据：

(edit: was missing a quotation) （编辑：缺少引号）

df <- structure(list(BMI = c(23, 23, 19, 21, 24, 25, 22, 20, 
20, 18, 18, 22, 23, 22, 20, 21, 20, 23, 26, 18, 20, 25, 28, 21, 
24, 21, 21, 19, 22, 19, 21, 27, 21, 20, 20, 20, 22, 25, 20, 24, 
25, 31, 27, 22, 21, 26, 23, 24, 31, 22, 22, 25, 24, 20, 23, 19, 
20, 24, 20, 22, 23, 21, 20, 22, 21, 22, 21, 25, 20, 31, 23, 22, 
24, 25, 23, 28, 20, 28, 20, 23, 27, 22, 21, 20, 25, 22, 28, 25, 
27, 27, 29, 21, 21, 24, 25, 24, 22, 29, 23, 34, 22, 27, 18, 25, 
23, 26, 23, 23, 21, 22, 29, 26, 23, 23, 21, 21, 24, 20, 21, 23, 
27, 24, 31, 25, 19, 21, 21, 23, 19, 21, 22, 26, 21, 22, 22, 23, 
25, 19, 20, 21, 20, 22, 20, 21, 26, 20, 22, 24, 21, 24, 22, 24, 
28, 22, 24, 25, 30, 20, 24, 29, 23, 24, 24, 22, 20, 21, 22, 25, 
19, 25, 20, 23, 25, 24, 17, 26, 25, 20, 21, 20, 22, 5, 26, 25, 
26, 20, 23, 20, 19, 25, 21, 37, 20, 28, 32, 22, 23, 26, 23, 21, 
24, 20, 22, 19, 24, 22, 22, 25, 24, 26, 25, 21, 21, 22, 27, 27, 
24, 24, 25, 26, 18, 21, 28, 25, 21, 22, 21, 19, 24, 21, 25, 23, 
21, 24, 22, 25, 23, 26, 23, 23, 21, 22, 25, 19, 24, 20, 26, 29, 
19, 22, 24, 30, 28, 24, 31, 22, 27, 25, 23, 23, 26, 23, 25, 23, 
24, 29, 23, 23, 26, 24, 32, 31, 22, 31, 22, 21, 18, 24, 21, 25, 
25, 22, 24, 28, 22, 23, 22, 24, 32, 28, 26, 27, 22, 20, 23, 18, 
20, 20, 19, 30, 28, 27, 29, 23, 20, 20, 25, 28, 22, 24, NA, 27
), Mortality = c(1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 
0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 
1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 
1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 
1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,  
0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 
1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 
0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 
1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 
0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0)), .Names = c("BMI", "Mortality"), row.names = c(NA, 
-312L), class = "data.frame")

Here's what I've got: 这是我得到的：

df$BMIcut <- cut(df$BMI,breaks = c(0,17.5,20,22.5,25,30))
df$MortBMIcut <- NULL
for(i in levels(df$BMIcut)){
  df[which(df$BMIcut==i & is.na(df$BMIcut)==F),"MortBMIcut"] <- 
    sum(df[which(df$BMIcut==i & is.na(df$BMIcut)==F &  is.na(df$Mortality)==F),"Mortality"])/
    NROW(df[which(df$BMIcut==i & is.na(df$BMIcut)==F & is.na(df$Mortality)==F),"Mortality"])
}

plot(MortBMIcut ~ BMIcut, data=df)

Which produces 哪个产生

在此处输入图片说明

There's got to be a faster way..? 必须有一个更快的方法..？

Answer 1

I don't understand why you need to make so many redundant copies to produce this plot: 我不明白为什么您需要制作许多冗余副本才能生成此图：

t <- prop.table(table(df$BMIcut, df$Mortality),1)
plot(x= factor(levels(df$BMIcut)), y= t[,2], ylim=c(0,1))

Though, as noted in the comments, I get different values than your plot. 但是，如评论中所述，我得到的值与您的图不同。 And you only have two obs in the (0,17.5] bucket. Not sure that's a nonlinearity vs data sparsity. 而且（0,17.5]存储桶中只有两个Obs。不确定这是非线性还是数据稀疏。

对连续的预测变量进行分类并计算二进制结果的比例

问题描述

1 个解决方案

解决方案1
1 2015-12-15 18:51:17

对连续的预测变量进行分类并计算二进制结果的比例

问题描述

1 个解决方案

解决方案1 1 2015-12-15 18:51:17

解决方案1
1 2015-12-15 18:51:17