简体   繁体   English

R:纵向数据随机采样

[英]R: Random Sampling of Longitudinal Data

I have the following dataset in R (eg the same students take an exam each year and their results are recorded):我在 R 中有以下数据集(例如,相同的学生每年参加考试并记录他们的成绩):

student_id = c(1,1,1,1,1, 2,2,2, 3,3,3,3)
exam_number = c(1,2,3,4,5,1,2,3,1,2,3,4)
exam_result = rnorm(12, 80,10)
my_data = data.frame(student_id, exam_number, exam_result)

  student_id exam_number exam_result
1          1           1    72.79595
2          1           2    81.12950
3          1           3    93.29906
4          1           4    79.33229
5          1           5    76.64106
6          2           1    95.14271

Suppose I take a random sample from this data:假设我从这些数据中随机抽样:

library(dplyr)
random_sample = sample_n(my_data, 5, replace = TRUE)

  student_id exam_number exam_result
1          3           1    76.19691
2          3           3    87.52431
3          2           2    91.89661
4          2           3    80.05088
5          2           2    91.89661

Now, I can take the highest "exam_number" per student from this random sample:现在,我可以从这个随机样本中获取每个学生的最高“exam_number”:

max_value = random_sample %>%
    group_by(student_id) %>%
    summarize(max = max(exam_number))

# A tibble: 2 x 2
  student_id   max
       <dbl> <dbl>
1          2     3
2          3     3

Based on these results - I want to accomplish the following.基于这些结果 - 我想完成以下任务。 For the students that were selected in "random_sample" :对于在“random_sample”中选择的学生

  • Create a dataset that contains all rows occurring AFTER the "max exam number" (eg call this dataset "data_after")创建一个数据集,其中包含在“最大考试编号”之后出现的所有行(例如,将此数据集称为“data_after”)
  • Create a dataset that contains all rows occurring BEFORE (and equal to) the "max exam number" (eg call this dataset "data_before")创建一个数据集,其中包含在“最大考试编号”之前(和等于)出现的所有行(例如,将此数据集称为“data_before”)

In the example I have created, this would look something like this:在我创建的示例中,这看起来像这样:

# after
  student_id exam_number exam_result
1          3           4    105.5805

# before

  student_id exam_number exam_result
1          2           1    95.14000
2          2           2    91.89000
3          2           3    80.05000
4          3           1    76.19691
5          3           2   102.00875
6          3           3    87.52431

Currently, I am trying to do this in a very indirect way using JOINS and ANTI_JOINS:目前,我正在尝试使用 JOINS 和 ANTI_JOINS 以非常间接的方式执行此操作:

    max_3 = as.numeric(max_value[2,2])
    max_s3 =  max_3 - 1
    student_3 = seq(1, max_s3 , by = 1)

    before_student_3 =  my_data[is.element(my_data$exam_number, student_3) & my_data$student_id == 3,]

remainder_student_3 = my_data[my_data$student_id == 3,]
after_student_3 =  anti_join(remainder_student_3, before_student_3)

But I don't think I am doing this correctly - can someone please show me how to do this?但我不认为我这样做是正确的 - 有人可以告诉我如何做到这一点吗?

Thanks!谢谢!

The code above also uses a join, like it is said in the question.上面的代码也使用了连接,就像问题中所说的那样。 Then, the wanted data sets are created by filter ing the join result.然后,通过对连接结果进行filter来创建所需的数据集。

student_id = c(1,1,1,1,1, 2,2,2, 3,3,3,3)
exam_number = c(1,2,3,4,5,1,2,3,1,2,3,4)
exam_result = rnorm(12, 80,10)
my_data = data.frame(student_id, exam_number, exam_result)

suppressPackageStartupMessages({
  library(dplyr)
})

set.seed(2022)
(random_sample = sample_n(my_data, 5, replace = TRUE))
#>   student_id exam_number exam_result
#> 1          1           4    73.97148
#> 2          1           3    84.77151
#> 3          2           2    78.76927
#> 4          3           3    69.35063
#> 5          1           4    73.97148


max_value = random_sample %>%
  group_by(student_id) %>%
  summarize(max = max(exam_number))

# join only once
max_value %>%
  left_join(my_data, by = "student_id") -> join_data
join_data
#> # A tibble: 12 × 4
#>    student_id   max exam_number exam_result
#>         <dbl> <dbl>       <dbl>       <dbl>
#>  1          1     4           1        71.0
#>  2          1     4           2        69.1
#>  3          1     4           3        84.8
#>  4          1     4           4        74.0
#>  5          1     4           5        80.7
#>  6          2     2           1        77.4
#>  7          2     2           2        78.8
#>  8          2     2           3        69.5
#>  9          3     3           1        83.9
#> 10          3     3           2        62.7
#> 11          3     3           3        69.4
#> 12          3     3           4       102.

data_before <- join_data %>%
  group_by(student_id) %>%
  filter(exam_number <= max) %>%
  ungroup() %>%
  select(-max)

data_after <- join_data %>%
  group_by(student_id) %>%
  filter(exam_number > max) %>%
  ungroup() %>%
  select(-max)

data_before
#> # A tibble: 9 × 3
#>   student_id exam_number exam_result
#>        <dbl>       <dbl>       <dbl>
#> 1          1           1        71.0
#> 2          1           2        69.1
#> 3          1           3        84.8
#> 4          1           4        74.0
#> 5          2           1        77.4
#> 6          2           2        78.8
#> 7          3           1        83.9
#> 8          3           2        62.7
#> 9          3           3        69.4

data_after
#> # A tibble: 3 × 3
#>   student_id exam_number exam_result
#>        <dbl>       <dbl>       <dbl>
#> 1          1           5        80.7
#> 2          2           3        69.5
#> 3          3           4       102.

# final clean-up
rm(join_data)

Created on 2022-12-10 with reprex v2.0.2创建于 2022-12-10,使用reprex v2.0.2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM