简体   繁体   English

R:使用 DPLYR 计算条件概率

[英]R: Calculating Conditional Probabilities With DPLYR

I am working with the R programming language.我正在使用 R 编程语言。

I have the following dataset - this data represents students (eg id = 1, id = 2, id = 3) who took an exam at different dates, and the result that they got (0 = pass, 1 = fail).我有以下数据集——该数据代表在不同日期参加考试的学生(例如 id = 1、id = 2、id = 3)以及他们得到的结果(0 = 通过,1 = 失败)。

library(data.table)

  my_data = data.table( structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), results = c(0, 
0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 
1), date_exam_taken = structure(c(12889, 12943, 15445, 15528, 
17840, 10623, 10680, 11186, 11971, 12826, 13744, 13805, 14904, 
15089, 15815, 16883, 17511, 17673, 11500, 12743, 14906, 15675, 
16774), class = "Date"), exam_number = c(1L, 2L, 3L, 4L, 5L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 1L, 2L, 
3L, 4L, 5L)), row.names = c(NA, 23L), class = "data.frame"))

> head(my_data)
   id results date_exam_taken exam_number
1:  1       0      2005-04-16           1
2:  1       0      2005-06-09           2
3:  1       1      2012-04-15           3
4:  1       1      2012-07-07           4
5:  1       1      2018-11-05           5
6:  2       0      1999-02-01           1

Using the following code in R, I was able to count the number of "3 exam transitions" - that is, I was able to count the number of times each student experienced:在 R 中使用以下代码,我能够计算出“3 次考试转换”的次数——也就是说,我能够计算出每个学生经历的次数:

  • "pass, pass, pass" “通过,通过,通过”
  • "pass, pass, fail" “通过,通过,失败”
  • etc ETC
  • "fail, fail, fail" “失败,失败,失败”

The R code looks something like this: R 代码看起来像这样:

my_data$current_exam = shift(my_data$results, 0)
my_data$prev_exam = shift(my_data$results, 1)
my_data$prev_2_exam = shift(my_data$results, 2)

# Count the number of exam results for each record
out <- my_data[!is.na(prev_exam), .(tally = .N), by = .(id, current_exam, prev_exam, prev_2_exam)]

out = na.omit(out)

> head(out)
    id current_exam prev_exam prev_2_exam tally
 1:  1            1         0           0     1
 2:  1            1         1           0     1
 3:  1            1         1           1     1
 4:  2            0         1           1     3

Now, I want to calculate the probability of the student pass/failing the current exam, conditional on the results of the previous exam and the second previous exam.现在,我想计算学生通过/未通过当前考试的概率,条件是上一次考试和第二次考试的结果。

I thought the best way to do this was to first perform an aggregation:我认为最好的方法是首先执行聚合:

library(dplyr)
agg = out %>% group_by(current_exam, prev_exam, prev_2_exam) %>% summarise(total = sum(tally))

> agg
# A tibble: 6 x 4
# Groups:   current_exam, prev_exam [3]
  current_exam prev_exam prev_2_exam total
         <dbl>     <dbl>       <dbl> <int>
1            0         1           0     1
2            0         1           1     4
3            1         0           0     1
4            1         0           1     5
5            1         1           0     4
6            1         1           1     6

From here, I am trying to look for an efficient way to calculate all conditional probabilities (ie P(current exam = 0 | prev_exam = 0 & prev_2_exam = 0)).从这里开始,我试图寻找一种有效的方法来计算所有条件概率(即 P(current exam = 0 | prev_exam = 0 & prev_2_exam = 0))。 These conditional probabilities should be aggregated for the group and should represent the conditional probability (of some event happening) of any student within the population.这些条件概率应该针对该组进行汇总,并且应该代表群体中任何学生的条件概率(某些事件发生的概率)。

I figured out how to do this manually for a single example:对于一个示例,我想出了如何手动执行此操作:

# prob (current = 1, given  prev = 1, 2nd_prev =1
p1 = agg[ agg$current_exam == 1 & agg$prev_exam == 1 & agg$prev_2_exam == 1,]
p2 = agg[ agg$current_exam == 0 & agg$prev_exam == 1 & agg$prev_2_exam == 1,]

final_prob_1_1_1 = sum(p1$total)/(sum(p1$total) + sum(p2$total))

But is there some easier way to do this, for all possible combinations?但是对于所有可能的组合,是否有更简单的方法来做到这一点? Is there some DPLYR function that can "look back" and count all combinations until the second last column and calculate all the conditional probabilities?是否有一些 DPLYR 函数可以“回顾”并计算所有组合直到倒数第二列并计算所有条件概率?

In the end - I am looking to get an output with 8 rows that looks something like this:最后 - 我希望得到一个 8 行的输出,看起来像这样:

 second_prev_prev      current_exam          probs
                11            1              prob1
                11            0              prob2
                10            1              prob3
                10            0              prob4
                01            1              prob5
                01            0              prob6
                00            1              prob7
                00            0              prob8

Thanks!谢谢!

Note: My attempt - is this correct?注意:我的尝试 - 这是正确的吗?

# my own attempt - I don't think this is correct because in row 5, row 6 - the probabilities sum to a value greater than 1? 
> agg %>%
     group_by(prev_exam, prev_2_exam) %>%
     mutate(probability = total / sum(total))
# A tibble: 6 x 5
# Groups:   prev_exam, prev_2_exam [4]
  current_exam prev_exam prev_2_exam total probability
         <dbl>     <dbl>       <dbl> <int>       <dbl>
1            0         1           0     1         0.2
2            0         1           1     4         0.4
3            1         0           0     1         1  
4            1         0           1     5         1  
5            1         1           0     4         0.8
6            1         1           1     6         0.6

Something like this might work for you:这样的事情可能对你有用:

library(dplyr)

my_data |> 
  arrange(id, exam_number) |> 
  group_by(id) |> 
  mutate(counter = 1:n(),
         results_lag = lag(results, n = 1),
         exams_passed = results + results_lag,
         prob = lag(exams_passed / counter)) 

You want the probability that a student passes (or fails) an exam given their previous two exam results.根据学生前两次考试的成绩,您需要学生通过(或未通过)考试的概率。 I first create the lags (exam_1 = previous exam, exam_2 = the one before that) and then aggregate (as you did).我首先创建滞后(exam_1 = 以前的考试,exam_2 = 之前的那个)然后聚合(就像你所做的那样)。

group_by(my_data, id) |>
  mutate(exam_1=lag(results, n=1),  
         exam_2=lag(results, n=2)) |>
  filter(!is.na(exam_2)) |>
  group_by(id, exam_2, exam_1) |>
  summarise(passed=sum(results==1),   # the number of times student passed the current exam 
            n=n(), .groups='drop') |> # the number of times these events occurred
  mutate(prob.pass=passed/n,
         prob.fail=1-prob.pass)

# A tibble: 8 × 7
     id exam_2 exam_1 passed     n prob.pass prob.fail
  <int>  <dbl>  <dbl>  <int> <int>     <dbl>     <dbl>
1     1      0      0      1     1      1         0   
2     1      0      1      1     1      1         0   
3     1      1      1      1     1      1         0   
4     2      0      1      3     4      0.75      0.25
5     2      1      0      3     3      1         0   
6     2      1      1      2     4      0.5       0.5 
7     3      1      0      1     1      1         0   
8     3      1      1      1     2      0.5       0.5 

You can verify these results just by looking at the original data.您只需查看原始数据即可验证这些结果。 For student 1, there are only 3 possibilities (fail/fail, fail/pass, pass/pass), each occurring once, and for each of these, they passed the current exam.对于学生 1,只有 3 种可能性(失败/失败、失败/通过、通过/通过),每种可能性发生一次,并且对于每一种可能性,他们都通过了当前考试。 So, the probabilities are all 1. For student 3, there are only 2 possibilities: (pass/fail, n=1) or (pass/pass, n=2) with probabilities 1 and 0.5, respectively.所以,概率都是 1。对于学生 3,只有 2 种可能性:(通过/失败,n=1)或(通过/通过,n=2),概率分别为 1 和 0.5。 For student 2, there are 3 possibilities (fail/pass, pass/fail, pass/pass) and the probabilties are the number of times they passed the current exam (n=3,3,2) divided by the number of times the events occurred (n=4,3,4) giving probabilties of 0.75, 1, and 0.5.对于学生 2,有 3 种可能性(失败/通过、通过/失败、通过/通过),概率是他们通过当前考试的次数 (n=3,3,2) 除以这次考试的次数事件发生 (n=4、3、4),概率分别为 0.75、1 和 0.5。

All other possibilities didn't occur in your data, so you can assume the probabilities are 0 (or you can say that you don't have enough data to calculate them).所有其他可能性都没有出现在您的数据中,因此您可以假设概率为 0(或者您可以说您没有足够的数据来计算它们)。

If you ignore the student, you get the following results:如果您忽略该学生,您会得到以下结果:

  exam_2 exam_1 passed     n prob.pass prob.fail
   <dbl>  <dbl>  <int> <int>     <dbl>     <dbl>
1      0      0      1     1     1         0    
2      0      1      4     5     0.8       0.2  
3      1      0      4     4     1         0    
4      1      1      4     7     0.571     0.429

Which says that, in a run of three exams, if a student fails the first (exam_2) but passes the second (exam_1), they are 80% likely to pass the third (passed).也就是说,在连续三场考试中,如果学生在第一场考试 (exam_2) 中失败但通过了第二场考试 (exam_1),则他们有 80% 的可能性通过第三场考试(通过)。 If they pass the first but fail the second, then they are 100% likely to pass the third.如果他们通过了第一个但没有通过第二个,那么他们有 100% 的可能通过第三个。 However, and this seems like an example of complacency, if they pass the first two, then they are only 57% likely to pass the third.然而,这似乎是自满的一个例子,如果他们通过了前两个,那么他们通过第三个的可能性只有 57%。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM