简体   繁体   English

计算学生连续两次考试不及格的概率?

[英]Calculating the probability that a student fails two consecutive exams?

I am working with the R programming language.我正在使用 R 编程语言。 I have the following dataset - students take an exam multiple times, they either pass ("1") or fail ("0").我有以下数据集——学生多次参加考试,他们要么通过(“1”),要么不通过(“0”)。 The data looks something like this:数据看起来像这样:

id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)

      id results date_exam_taken exam_number
7992   1       1      2004-04-23           1
24837  1       0      2004-12-10           2
12331  1       1      2007-01-19           3
34396  1       0      2007-02-21           4
85250  1       0      2007-09-26           5
11254  1       1      2009-12-20           6

my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]

my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL

I was interested in finding out - suppose a student fails an exam, what is the probability that this student fails the next exam?我很想知道——假设一个学生考试不及格,这个学生下一次考试不及格的概率是多少? (eg Fails the 1st exam - what is the probability of failing the 2nd exam? Fails the 5th exam - what is the probability of failing the 6th exam?). (例如,第一次考试不及格 - 第二次考试不及格的概率是多少?第五次考试不及格 - 第六次考试不及格的概率是多少?)。 I wrote the following loop to answer this question:我写了以下循环来回答这个问题:

my_list = list()

for (i in 1:length(unique(my_data$id)))
    
{ 
    {tryCatch({
        
        start_i = my_data[my_data$id == i,]
        
        pairs_i =  data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
        frame_i =  as.data.frame(table(pairs_i))
        frame_i$id = i
        print(frame_i)
        my_list[[i]] = frame_i
    }, error = function(e){})
    }}


 final = do.call(rbind.data.frame, my_list)
   #################################################



library(dplyr)
total_1 = final %>% group_by(first, second) %>% summarise(totals = n())
total_2 = total_1 %>% group_by(first) %>% summarise(sum = sum(totals))

join = merge(x = total_1, y = total_2, by = "first", all = TRUE)
join$probs = join$totals/join$sum

The final answer looks something like this:最终答案看起来像这样:

  first second totals   sum     probs
1     0      0   9817 19557 0.5019686
2     0      1   9740 19557 0.4980314
3     1      0   9727 19498 0.4988717
4     1      1   9771 19498 0.5011283

Now, I am trying to modify the above code so that the analysis is performed at the "Second Level" - that is, I want to find out the probability of a student failing the next exam given that the student failed the two previous exams?现在,我正在尝试修改上面的代码,以便在“第二级”进行分析——也就是说,我想找出学生在前两次考试都未通过的情况下未能通过下一次考试的概率? As an example, supposed the student failed the 3rd and the 4th exam - what is the probability that a student will fail the 5th exam?例如,假设学生在第 3 次和第 4 次考试中不及格 - 学生在第 5 次考试中不及格的概率是多少?

I think that the final answer should look something like this:我认为最终答案应该是这样的:

# note: "first" and "second" refer to any consecutive exams (not the literal first and second exam), and "third" refers to the exam occurring immediately after the "second" exam

  first second third totals sums probs
1     1      1     1    ...  ...   ...
2     0      0     0    ...  ...   ...
3     1      0     1    ...  ...   ...
4     0      1     0    ...  ...   ...
5     0      0     1    ...  ...   ...
6     1      1     0    ...  ...   ...
7     0      1     1    ...  ...   ...
8     1      0     0    ...  ...   ...

I tried to manually modify my code for these requirements:我试图根据这些要求手动修改我的代码:

library(stringr)
my_list = list()

for (i in 1:length(unique(my_data$id)))

{ 
    {tryCatch({


start_i = my_data[my_data$id == i,]

vals_i = as.numeric(paste(start_i$results, collapse = ""))
L_1_i = lengths(gregexpr("111", vals_i))
L_2_i = lengths(gregexpr("000", vals_i))
L_3_i = lengths(gregexpr("101", vals_i))
L_4_i = lengths(gregexpr("010", vals_i))
L_5_i = lengths(gregexpr("001", vals_i))
L_6_i = lengths(gregexpr("110", vals_i))
L_7_i = lengths(gregexpr("011", vals_i))
L_8_i = lengths(gregexpr("100", vals_i))

frame_i = data.frame(class = c("111","000","101","010","001","110","011","100"), values = c(L_1_i, L_2_i, L_3_i, L_4_i, L_5_i, L_6_i, L_7_i, L_8_i))
frame_i$id = i
print(frame_i)


        my_list[[i]] = frame_i
    }, error = function(e){})
    }}


final = do.call(rbind.data.frame, my_list)


final$first = substr(final$class, 1,1)
final$second = substr(final$class, 2,2)
final$third = substr(final$class, 3,3)


total_1 = final %>% group_by(first, second, third) %>% summarise(totals = n())
total_2 = total_1 %>% group_by(first, second) %>% summarise(sum = sum(totals))

join = merge(x = total_1, y = total_2, by = c("first", "second"), all = TRUE)
join$probs = join$totals/join$sum

As an example - to answer the original question, we would now query the results (assuming I did this correctly).举个例子——为了回答最初的问题,我们现在要查询结果(假设我做对了)。 Here is the probability of a student passing an exam given that the student failed the previous two exams:这是学生在前两次考试未通过的情况下通过考试的概率:

 first second third totals   sum probs
1     0      0     0  10000 20000   0.5
2     0      0     1  10000 20000   0.5
3     0      1     0  10000 20000   0.5
4     0      1     1  10000 20000   0.5
5     1      0     0  10000 20000   0.5
6     1      0     1  10000 20000   0.5
7     1      1     0  10000 20000   0.5
8     1      1     1  10000 20000   0.5

join[join$first == 0 & join$second ==0, & join$third == 1,]

I am not sure if what I have done is correct - can someone please help me correct my code (and possibly show me a "better" way to do this)?我不确定我所做的是否正确 - 有人可以帮我更正我的代码(并可能告诉我一个“更好”的方法来做到这一点)?

Note: These probabilities look two "uniform" and make me inclined to think I did something wrong...注意:这些概率看起来两个“统一”,让我倾向于认为我做错了什么......

P(A|B) = P(B|A)*P(A)/P(B)

Where A is the probability of failing exam n and B is the probability of failing exams n-1 and n-2 .其中 A 是考试n不及格的概率,B 是考试n-1 and n-2不及格的概率。

We can write a function to calculate P(A|B)我们可以写一个函数来计算P(A|B)

d <- my_data
d %>% 
  dplyr::pivot_wider(id, names_from = "exam_number", values_from = "results") -> dd

p_fail_given_failfail <- function(n){ #n is integer > 2
  #check if the student took the exam n. Remove them if they didn't.
  indx.na <- is.na(dd[, n+1])
  dd <- dd[!indx.na, ]
  
  #Calculate probabilities
  p_BA <- nrow(dd[dd[, n+1]==0 & dd[, n]==0 & dd[, n-1]==0, ]) /nrow(dd[dd[, n+1]==0, ])
  p_A <- nrow(dd[dd[, n+1]==0, ]) / nrow(dd)
  p_B <- nrow(dd[dd[, n]==0 & dd[, n-1]==0, ]) / nrow(dd)

  p_AB <- p_BA*p_A/p_B
  return(p_AB)
}

p_fail_given_failfail(3) #prob of failing exam3, given failed exam2 and exam1
#0.5084479

This approach generalizes answering the question "if I have a coin-flip distribution, what would be the probability of the test fails the n + 1 test after failing the previous n events".这种方法概括地回答了“如果我有抛硬币分布,那么在前面的 n 个事件失败后,测试失败 n + 1 测试的概率是多少”。

## Define parameters
students = 10000L
tests = 3L

## simulate tests. 
##  Generates students x tests matrix of test results.
test_results = replicate(tests,
                         sample(c(TRUE,FALSE),
                                students,
                                replace = TRUE))

## determine student set that were failures up to test - 1
complete_failures = rowSums(test_results[, seq_len(tests-1), drop = FALSE]) == 0L
final_test = test_results[complete_failures, tests]

## summarize results
prop.table(table(final_test))
#> final_test
#>     FALSE      TRUE 
#> 0.5150162 0.4849838

OP may be more interested in binomial distribution. OP可能对二项式分布更感兴趣。 That is, with n trials, what is the probability that only 1 is successful.也就是说,进行 n 次试验,只有 1 次成功的概率是多少。 In that case, we can use pbinom similar to:在这种情况下,我们可以使用类似于以下的pbinom

pbinom(0.5, 1, 0.5)
#> [1] 0.5
pbinom(0.5, 2, 0.5)
#> [1] 0.25
pbinom(0.5, 3, 0.5)
#> [1] 0.125

Agree with @cole but it may be negative binomial (rnbinom), since n is # exams before first pass (r=1) with each exam being iid Bernoulli.同意@cole,但它可能是负二项式 (rnbinom),因为 n 是第一次通过前的 # 次考试 (r=1),每次考试都是 iid Bernoulli。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM