[英]Joining baseline and time-varying data in R
I am trying to merge two datasets in R.我正在尝试合并 R 中的两个数据集。 One of them contains baseline data for a cohort, and the other contains updated time-varying data for those same people over time.
其中一个包含一个群组的基线数据,另一个包含随着时间的推移这些相同的人更新的随时间变化的数据。 I need to merge the two into a long form dataset with one row for each year, but keep the non-time-varying variables (like sex or race, which don't get updated) the same in each row.
我需要将这两者合并成一个长格式数据集,每一行都有一行,但在每一行中保持非时变变量(如性别或种族,不会更新)相同。
For example, with the datasets below, I would want 10 rows per ID number with marital_status
and employment
updated in each row, but sex
remain fixed for each ID number.例如,对于下面的数据集,我希望每个 ID 号有 10 行,每行更新
marital_status
和employment
,但每个 ID 号的sex
保持固定。 This seems like it should be relatively simple, but I can't find a way to merge them without leaving sex
as NA
in the years past baseline.这似乎应该相对简单,但我无法找到一种方法来合并它们,而不会在过去的基线中将
sex
保留为NA
。
baseline <- data.frame(
ID = c(1:10),
year = 2000,
marital_status = (sample(0:1, 10, replace = TRUE)),
employment = (sample(0:1, 10, replace = TRUE)),
sex = (sample(c("M","F"), 10, replace = TRUE))
)
head(baseline)
time_varying <- data.frame(
ID = c(1:10),
year = rep(2001:2010, 10),
marital_status = (sample(0:1, 100, replace = TRUE)),
employment = (sample(0:1, 100, replace = TRUE))
)
head(time_varying)
I think tidyr::fill()
is what you're looking for.我认为
tidyr::fill()
是您正在寻找的。 It fills in NA
s with the last non- NA
value.它用最后一个非
NA
值填充NA
。
Example:例子:
library(tidyverse)
baseline <- data.frame(
ID = c(1:10),
year = 2000,
marital_status = (sample(0:1, 10, replace = TRUE)),
employment = (sample(0:1, 10, replace = TRUE)),
sex = (sample(c("M","F"), 10, replace = TRUE))
)
time_varying <- data.frame(
ID = c(1:10),
year = rep(2001:2010, 10),
marital_status = (sample(0:1, 100, replace = TRUE)),
employment = (sample(0:1, 100, replace = TRUE))
)
baseline %>%
bind_rows(time_varying) %>%
group_by(ID) %>%
arrange(year) %>%
fill(sex)
#> # A tibble: 110 × 5
#> # Groups: ID [10]
#> ID year marital_status employment sex
#> <int> <dbl> <int> <int> <chr>
#> 1 1 2000 1 0 F
#> 2 2 2000 0 0 M
#> 3 3 2000 0 1 F
#> 4 4 2000 0 0 M
#> 5 5 2000 0 1 F
#> 6 6 2000 1 0 F
#> 7 7 2000 1 0 F
#> 8 8 2000 1 0 F
#> 9 9 2000 1 0 M
#> 10 10 2000 0 1 F
#> # … with 100 more rows
Created on 2022-07-29 by the reprex package (v2.0.1)由代表 package (v2.0.1) 于 2022 年 7 月 29 日创建
EDIT: If you want one row per ID, corresponding to the unique years in time_varying
, then I'd do:编辑:如果您想要每个 ID 一行,对应于
time_varying
中的唯一年份,那么我会这样做:
baseline %>%
# add any other time-invariant cols in this next line
select(ID, sex) %>%
left_join(time_varying, by = "ID")
#> ID sex year marital_status employment
#> 1 1 M 2001 0 0
#> 2 1 M 2001 1 0
#> 3 1 M 2001 0 0
#> 4 1 M 2001 1 0
#> 5 1 M 2001 1 1
#> 6 1 M 2001 0 0
#> 7 1 M 2001 1 1
#> 8 1 M 2001 0 1
#> 9 1 M 2001 1 1
#> 10 1 M 2001 0 1
#> 11 2 F 2002 1 0
#> 12 2 F 2002 0 0
#> 13 2 F 2002 1 0
#> 14 2 F 2002 1 1
#> 15 2 F 2002 1 0
#> 16 2 F 2002 1 1
#> 17 2 F 2002 1 1
#> 18 2 F 2002 0 1
#> 19 2 F 2002 1 1
#> 20 2 F 2002 1 1
#> 21 3 M 2003 0 0
#> 22 3 M 2003 1 0
#> 23 3 M 2003 1 0
#> 24 3 M 2003 1 0
#> 25 3 M 2003 1 1
#> 26 3 M 2003 1 0
#> 27 3 M 2003 1 1
#> 28 3 M 2003 1 0
#> 29 3 M 2003 0 1
#> 30 3 M 2003 1 0
#> 31 4 F 2004 1 1
#> 32 4 F 2004 1 0
#> 33 4 F 2004 1 0
#> 34 4 F 2004 0 1
#> 35 4 F 2004 1 0
#> 36 4 F 2004 0 0
#> 37 4 F 2004 1 0
#> 38 4 F 2004 1 1
#> 39 4 F 2004 1 1
#> 40 4 F 2004 1 1
#> 41 5 F 2005 0 0
#> 42 5 F 2005 1 0
#> 43 5 F 2005 1 1
#> 44 5 F 2005 1 1
#> 45 5 F 2005 1 0
#> 46 5 F 2005 1 1
#> 47 5 F 2005 0 1
#> 48 5 F 2005 0 1
#> 49 5 F 2005 1 1
#> 50 5 F 2005 0 0
#> 51 6 F 2006 0 1
#> 52 6 F 2006 1 0
#> 53 6 F 2006 0 0
#> 54 6 F 2006 1 1
#> 55 6 F 2006 0 1
#> 56 6 F 2006 1 1
#> 57 6 F 2006 0 1
#> 58 6 F 2006 1 1
#> 59 6 F 2006 0 1
#> 60 6 F 2006 0 1
#> 61 7 F 2007 1 0
#> 62 7 F 2007 0 1
#> 63 7 F 2007 1 0
#> 64 7 F 2007 1 0
#> 65 7 F 2007 1 1
#> 66 7 F 2007 0 0
#> 67 7 F 2007 0 1
#> 68 7 F 2007 1 0
#> 69 7 F 2007 1 0
#> 70 7 F 2007 1 0
#> 71 8 M 2008 1 1
#> 72 8 M 2008 0 0
#> 73 8 M 2008 0 1
#> 74 8 M 2008 1 0
#> 75 8 M 2008 1 0
#> 76 8 M 2008 1 0
#> 77 8 M 2008 1 1
#> 78 8 M 2008 0 1
#> 79 8 M 2008 1 0
#> 80 8 M 2008 1 0
#> 81 9 F 2009 1 0
#> 82 9 F 2009 0 1
#> 83 9 F 2009 1 1
#> 84 9 F 2009 1 0
#> 85 9 F 2009 1 0
#> 86 9 F 2009 1 1
#> 87 9 F 2009 1 1
#> 88 9 F 2009 0 1
#> 89 9 F 2009 0 0
#> 90 9 F 2009 0 1
#> 91 10 M 2010 0 0
#> 92 10 M 2010 1 0
#> 93 10 M 2010 0 0
#> 94 10 M 2010 0 1
#> 95 10 M 2010 0 0
#> 96 10 M 2010 1 1
#> 97 10 M 2010 0 0
#> 98 10 M 2010 1 1
#> 99 10 M 2010 0 0
#> 100 10 M 2010 0 0
Created on 2022-07-29 by the reprex package (v2.0.1)由代表 package (v2.0.1) 于 2022 年 7 月 29 日创建
Given the data you have, you can bind the rows from time_varying
with the rows from baseline
(without sex
);给定您拥有的数据,您可以将
time_varying
中的行与baseline
中的行绑定(没有sex
); then (inner or outer) join the baseline
non-time varying columns (ID, sex), with those bound rows:然后(内部或外部)将
baseline
非时变列(ID、性别)与这些绑定行连接起来:
library(dplyr)
inner_join(
select(baseline, ID, sex),
bind_rows(select(baseline,-sex), time_varying)
)
(Note: as @thelatemail has commented, if the baseline year is 2000, and the time_varying data is from years 2001 to 2010, then long data will have 11 rows per ID) (注意:正如@thelatemail 所评论的,如果基线年份是 2000 年,并且 time_varying 数据是从 2001 年到 2010 年,那么长数据每个 ID 将有 11 行)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.