简体   繁体   English

使用 prob 包计算 R 中的条件概率

[英]Using prob package to calculate a conditional probability in R

My data looks like this:我的数据如下所示:

d

#> # A tibble: 220 x 2
#>    smoker pain 
#>    <chr>  <chr>
#>  1 Smoker Pain 
#>  2 Smoker Pain 
#>  3 Smoker Pain 
#>  4 Smoker Pain 
#>  5 Smoker Pain 
#>  6 Smoker Pain 
#>  7 Smoker Pain 
#>  8 Smoker Pain 
#>  9 Smoker Pain 
#> 10 Smoker Pain 
#> # … with 210 more rows

Is a combination between two variables: smokers and pain.是两个变量的组合:吸烟者和疼痛。

d %>% 
  count(smoker, pain, sort = T)
#> # A tibble: 4 x 3
#>   smoker    pain        n
#>   <chr>     <chr>   <int>
#> 1 No smoker No pain   107
#> 2 Smoker    Pain       70
#> 3 Smoker    No pain    35
#> 4 No smoker Pain        8

I want to calculate the probability of a person feeling pain given he is a smoker P(pain|smoker):我想计算一个人在吸烟的情况下感到疼痛的概率 P(pain|smoker):

library(tidyverse)
library(prob)

d <- probspace(d)
Prob(d, event = smoker == "Smoker", given = pain == "Pain")
#> [1] 0.01282051

As far as I know this value must be the percentage of smokers that feel pain:据我所知,这个值一定是吸烟者感到疼痛的百分比:

70/105

#> [1] 0.667

What is wrong here?这里有什么问题?

This is the code for the data:这是数据的代码:

smoker <- c(rep("Smoker", 105), rep("No smoker", 115))
pain <- c(rep("Pain", 70), rep("No pain", 35), rep("Pain", 8), rep("No pain", 107))

d <- tibble(smoker, pain)

I think you should add one more line d <- cbind(id = seq(nrow(d)),d) after d <- tibble(smoker, pain) , ie,我认为你应该在d <- cbind(id = seq(nrow(d)),d) d <- tibble(smoker, pain)之后再添加一行d <- cbind(id = seq(nrow(d)),d) d <- tibble(smoker, pain) ,即,

d <- tibble(smoker, pain)
d <- cbind(id = seq(nrow(d)),d)

then you will get the desired result然后你会得到想要的结果

> Prob(d, event = pain == "Pain", given = smoker == "Smoker")
[1] 0.6666667

NOTE : The reason behind of doing this is that, Prob() calculates the intersect() between event and given condition.注意:这样做的原因是, Prob()计算事件和给定条件之间的intersect() When you are using data frames for the probability space, the duplicates in the intersection will be dropped.当您将数据框用于概率空间时,交集中的重复项将被删除。 To avoid that, you need to manually add extra information to distinguish rows in the data frame d , such that all duplicates can be saved till the end of calculation.为避免这种情况,您需要手动添加额外信息来区分数据框d中的行,以便所有重复项都可以保存到计算结束。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM