简体   繁体   中英

Calculating probability given data from other columns

I have a txt that contains data such as:

ranking index  tornado reports   hail reports   wind reports
0.3968208   9   1   7
0.156263    2   0   3
0.1444246   10  1   7
0.2830781   7   2   6
0.1258707   12  0   2
0.2452705   6   0   6
0.07492937  6   2   8
0.1862151   8   1   5
0.3258324   6   2   17
0.09579834  2   2   10
0.8557362   11  3   14
0.05694438  8   3   9
0.6755703   4   3   24
1.695709    14  0   5
1.242222    17  2   12
0.220234    7   1   7
0.5113825   6   0   6
0.2355718   3   0   12
0.0799512   1   1   6
1.267324    15  2   6
0.0862502   7   1   3
1.151916    33  2   6
0.06002221  9   0   17
0.2011567   11  5   17

I need to find the probability of a wind outbreak being major (ranking index larger than 0.25), given the number of hail reports is larger than 10, the number of wind reports is larger than 20, and the number of tornado reports is larger than 5?

Assuming this is a part of the complete data . The below dplyr based solution is based on conditions: hail_reports > 2 & wind_reports > 2 & tornado_reports > 5 (or else you would get a probability of zero for this test data). Modify it appropriately for complete data.

librray(dplyr)

df %>% 
   filter (hail_reports > 2 & wind_reports > 2 & tornado_reports > 5) %>% 
mutate(major = if_else(ranking_index > 0.25, 1, 0)) %>%     # major= 1: index > 0.25
  group_by(major) %>% summarize(n = n()) %>% 
transmute(major, prob = n/sum(n))

#    major  prob
#    <dbl> <dbl>
#  1     0 0.667
#  2     1 0.333                     # major prob = 0.333

PS: Always better to avoid spaces in column names. For eg. use "hail_reports" instead of "hail reports"

我认为这是一个不可能发生的事件,因为在给定的数据集中,冰雹报告的数量绝不会大于10.或者上面提供的只是一个样本,而不是完整的集合?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM