简体   繁体   English

在 R 中加入基线和时变数据

[英]Joining baseline and time-varying data in R

I am trying to merge two datasets in R.我正在尝试合并 R 中的两个数据集。 One of them contains baseline data for a cohort, and the other contains updated time-varying data for those same people over time.其中一个包含一个群组的基线数据,另一个包含随着时间的推移这些相同的人更新的随时间变化的数据。 I need to merge the two into a long form dataset with one row for each year, but keep the non-time-varying variables (like sex or race, which don't get updated) the same in each row.我需要将这两者合并成一个长格式数据集,每一行都有一行,但在每一行中保持非时变变量(如性别或种族,不会更新)相同。

For example, with the datasets below, I would want 10 rows per ID number with marital_status and employment updated in each row, but sex remain fixed for each ID number.例如,对于下面的数据集,我希望每个 ID 号有 10 行,每行更新marital_statusemployment ,但每个 ID 号的sex保持固定。 This seems like it should be relatively simple, but I can't find a way to merge them without leaving sex as NA in the years past baseline.这似乎应该相对简单,但我无法找到一种方法来合并它们,而不会在过去的基线中将sex保留为NA

baseline <- data.frame(
  ID = c(1:10),
  year = 2000,
  marital_status = (sample(0:1, 10, replace = TRUE)),
  employment = (sample(0:1, 10, replace = TRUE)),
  sex = (sample(c("M","F"), 10, replace = TRUE))
)

head(baseline)

time_varying <- data.frame(
  ID = c(1:10),
  year = rep(2001:2010, 10),
  marital_status = (sample(0:1, 100, replace = TRUE)),
  employment = (sample(0:1, 100, replace = TRUE))
)

head(time_varying)

I think tidyr::fill() is what you're looking for.我认为tidyr::fill()是您正在寻找的。 It fills in NA s with the last non- NA value.它用最后一个非NA值填充NA

Example:例子:

library(tidyverse)

baseline <- data.frame(
  ID = c(1:10),
  year = 2000,
  marital_status = (sample(0:1, 10, replace = TRUE)),
  employment = (sample(0:1, 10, replace = TRUE)),
  sex = (sample(c("M","F"), 10, replace = TRUE))
)


time_varying <- data.frame(
  ID = c(1:10),
  year = rep(2001:2010, 10),
  marital_status = (sample(0:1, 100, replace = TRUE)),
  employment = (sample(0:1, 100, replace = TRUE))
)

baseline %>%
  bind_rows(time_varying) %>%
  group_by(ID) %>%
  arrange(year) %>%
  fill(sex) 
#> # A tibble: 110 × 5
#> # Groups:   ID [10]
#>       ID  year marital_status employment sex  
#>    <int> <dbl>          <int>      <int> <chr>
#>  1     1  2000              1          0 F    
#>  2     2  2000              0          0 M    
#>  3     3  2000              0          1 F    
#>  4     4  2000              0          0 M    
#>  5     5  2000              0          1 F    
#>  6     6  2000              1          0 F    
#>  7     7  2000              1          0 F    
#>  8     8  2000              1          0 F    
#>  9     9  2000              1          0 M    
#> 10    10  2000              0          1 F    
#> # … with 100 more rows

Created on 2022-07-29 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2022 年 7 月 29 日创建

EDIT: If you want one row per ID, corresponding to the unique years in time_varying , then I'd do:编辑:如果您想要每个 ID 一行,对应于time_varying中的唯一年份,那么我会这样做:

baseline %>%
  # add any other time-invariant cols in this next line
  select(ID, sex) %>% 
  left_join(time_varying, by = "ID")
#>     ID sex year marital_status employment
#> 1    1   M 2001              0          0
#> 2    1   M 2001              1          0
#> 3    1   M 2001              0          0
#> 4    1   M 2001              1          0
#> 5    1   M 2001              1          1
#> 6    1   M 2001              0          0
#> 7    1   M 2001              1          1
#> 8    1   M 2001              0          1
#> 9    1   M 2001              1          1
#> 10   1   M 2001              0          1
#> 11   2   F 2002              1          0
#> 12   2   F 2002              0          0
#> 13   2   F 2002              1          0
#> 14   2   F 2002              1          1
#> 15   2   F 2002              1          0
#> 16   2   F 2002              1          1
#> 17   2   F 2002              1          1
#> 18   2   F 2002              0          1
#> 19   2   F 2002              1          1
#> 20   2   F 2002              1          1
#> 21   3   M 2003              0          0
#> 22   3   M 2003              1          0
#> 23   3   M 2003              1          0
#> 24   3   M 2003              1          0
#> 25   3   M 2003              1          1
#> 26   3   M 2003              1          0
#> 27   3   M 2003              1          1
#> 28   3   M 2003              1          0
#> 29   3   M 2003              0          1
#> 30   3   M 2003              1          0
#> 31   4   F 2004              1          1
#> 32   4   F 2004              1          0
#> 33   4   F 2004              1          0
#> 34   4   F 2004              0          1
#> 35   4   F 2004              1          0
#> 36   4   F 2004              0          0
#> 37   4   F 2004              1          0
#> 38   4   F 2004              1          1
#> 39   4   F 2004              1          1
#> 40   4   F 2004              1          1
#> 41   5   F 2005              0          0
#> 42   5   F 2005              1          0
#> 43   5   F 2005              1          1
#> 44   5   F 2005              1          1
#> 45   5   F 2005              1          0
#> 46   5   F 2005              1          1
#> 47   5   F 2005              0          1
#> 48   5   F 2005              0          1
#> 49   5   F 2005              1          1
#> 50   5   F 2005              0          0
#> 51   6   F 2006              0          1
#> 52   6   F 2006              1          0
#> 53   6   F 2006              0          0
#> 54   6   F 2006              1          1
#> 55   6   F 2006              0          1
#> 56   6   F 2006              1          1
#> 57   6   F 2006              0          1
#> 58   6   F 2006              1          1
#> 59   6   F 2006              0          1
#> 60   6   F 2006              0          1
#> 61   7   F 2007              1          0
#> 62   7   F 2007              0          1
#> 63   7   F 2007              1          0
#> 64   7   F 2007              1          0
#> 65   7   F 2007              1          1
#> 66   7   F 2007              0          0
#> 67   7   F 2007              0          1
#> 68   7   F 2007              1          0
#> 69   7   F 2007              1          0
#> 70   7   F 2007              1          0
#> 71   8   M 2008              1          1
#> 72   8   M 2008              0          0
#> 73   8   M 2008              0          1
#> 74   8   M 2008              1          0
#> 75   8   M 2008              1          0
#> 76   8   M 2008              1          0
#> 77   8   M 2008              1          1
#> 78   8   M 2008              0          1
#> 79   8   M 2008              1          0
#> 80   8   M 2008              1          0
#> 81   9   F 2009              1          0
#> 82   9   F 2009              0          1
#> 83   9   F 2009              1          1
#> 84   9   F 2009              1          0
#> 85   9   F 2009              1          0
#> 86   9   F 2009              1          1
#> 87   9   F 2009              1          1
#> 88   9   F 2009              0          1
#> 89   9   F 2009              0          0
#> 90   9   F 2009              0          1
#> 91  10   M 2010              0          0
#> 92  10   M 2010              1          0
#> 93  10   M 2010              0          0
#> 94  10   M 2010              0          1
#> 95  10   M 2010              0          0
#> 96  10   M 2010              1          1
#> 97  10   M 2010              0          0
#> 98  10   M 2010              1          1
#> 99  10   M 2010              0          0
#> 100 10   M 2010              0          0

Created on 2022-07-29 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2022 年 7 月 29 日创建

Given the data you have, you can bind the rows from time_varying with the rows from baseline (without sex );给定您拥有的数据,您可以将time_varying中的行与baseline中的行绑定(没有sex ); then (inner or outer) join the baseline non-time varying columns (ID, sex), with those bound rows:然后(内部或外部)将baseline非时变列(ID、性别)与这些绑定行连接起来:


library(dplyr)

inner_join(
  select(baseline, ID, sex),
  bind_rows(select(baseline,-sex), time_varying)
)

(Note: as @thelatemail has commented, if the baseline year is 2000, and the time_varying data is from years 2001 to 2010, then long data will have 11 rows per ID) (注意:正如@thelatemail 所评论的,如果基线年份是 2000 年,并且 time_varying 数据是从 2001 年到 2010 年,那么长数据每个 ID 将有 11 行)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM