When joining two dataframes, how can I replace missing values in one dataset with values from the other dataset?
My working example comes from a 3 wave (time points) study, where some questions where omitted from consecutive waves. I want to produce a full dataset with all waves in a long format, that I can easly split into smaller sets while keeping all the variables meaningfull.
Here is some reproducible code:
df1<-data.frame(id=seq(10),
sex=rep(c(1,2), 5),
age=sample(c(18:24), 10, replace = T),
x = rnorm(10),
wave = rep("wave1", 10))
df2<-data.frame(id=seq(10),
x = rnorm(10),
wave = rep("wave2", 10))
dplyr::full_join(df1, df2)
Joining, by = c("id", "x", "wave")
id sex age x wave
1 1 1 18 0.7236847 wave1
2 2 2 18 0.5730599 wave1
3 3 1 21 2.0341799 wave1
4 4 2 20 -0.1531575 wave1
5 5 1 18 -0.6089901 wave1
6 6 2 18 -0.3233804 wave1
7 7 1 19 -0.1417807 wave1
8 8 2 21 0.9557512 wave1
9 9 1 24 0.6522168 wave1
10 10 2 20 0.1595824 wave1
11 1 NA NA 1.9694018 wave2
12 2 NA NA 1.4153806 wave2
13 3 NA NA 1.1160011 wave2
14 4 NA NA -0.6040353 wave2
15 5 NA NA -0.3750569 wave2
16 6 NA NA 0.4826182 wave2
17 7 NA NA 0.7210480 wave2
18 8 NA NA 1.9068413 wave2
19 9 NA NA 1.5355046 wave2
20 10 NA NA 1.3607414 wave2
My goal is: Based on participant id
replace NA in sex
and age
for wave2 measurments with wave1 data.
EDIT :Please assume, that I no longer have access to df1
and df2
- I'm working with the joint data only, and in reality there are more variables that come with ``NA`s. I should have specified this earlier.
update
without access to df1 and df2, you can use zoo
's na.locf
-function
df <- dplyr::full_join(df1, df2)
library( zoo )
library( data.table )
dt <- setDT(df)[, `:=`( sex = zoo::na.locf( zoo::na.locf( sex, na.rm = FALSE ) ),
age = zoo::na.locf( zoo::na.locf( age, na.rm = FALSE ) ) ), by = id ]
dt
# id sex age x wave
# 1: 1 1 22 -1.03971504 wave1
# 2: 2 2 22 -0.40848104 wave1
# 3: 3 1 18 -0.32354030 wave1
# 4: 4 2 23 0.01220463 wave1
# 5: 5 1 24 0.83800380 wave1
# 6: 6 2 19 0.31674395 wave1
# 7: 7 1 22 -0.62997068 wave1
# 8: 8 2 19 -0.02830660 wave1
# 9: 9 1 23 -0.48257814 wave1
# 10: 10 2 24 -0.82725441 wave1
# 11: 1 1 22 -2.04179796 wave2
# 12: 2 2 22 1.66578389 wave2
# 13: 3 1 18 0.63893257 wave2
# 14: 4 2 23 0.37758646 wave2
# 15: 5 1 24 -1.64174887 wave2
# 16: 6 2 19 -2.93152667 wave2
# 17: 7 1 22 0.14474519 wave2
# 18: 8 2 19 -1.18826640 wave2
# 19: 9 1 23 0.68365951 wave2
# 20: 10 2 24 -0.21636650 wave2
You actually need to rbind
not merge, so you can create the two extra columns and rbind
, ie
rbind(df1, data.frame(df2, sex = df1$sex, age = df1$age))
which gives,
id sex age x wave 1 1 1 24 0.23277867 wave1 2 2 2 19 0.28211730 wave1 3 3 1 23 0.69541360 wave1 4 4 2 21 0.11846487 wave1 5 5 1 23 -0.08540101 wave1 6 6 2 19 1.55917732 wave1 7 7 1 20 -0.27636738 wave1 8 8 2 20 -1.55094487 wave1 9 9 1 21 1.60901222 wave1 10 10 2 21 -0.05709374 wave1 11 1 1 24 -0.86825838 wave2 12 2 2 19 -0.32215557 wave2 13 3 1 23 -1.29894673 wave2 14 4 2 21 -0.24631532 wave2 15 5 1 23 2.65130947 wave2 16 6 2 19 0.03424642 wave2 17 7 1 20 0.55383179 wave2 18 8 2 20 0.09771911 wave2 19 9 1 21 -0.14435681 wave2 20 10 2 21 -1.66916275 wave2
If you want to consider changing values after join we can match
and then update values
df3 <- dplyr::full_join(df1, df2)
inds <- match(df3$id[df3$wave == "wave1"], df3$id[df3$wave == "wave2"])
df3[df3$wave == "wave2", c("sex", "age")] <- df3[inds, c("sex", "age")]
# id sex age x wave
#1 1 1 24 -0.76956510 wave1
#......
#......
#16 6 2 24 -0.25209124 wave2
#17 7 1 24 1.93524314 wave2
#18 8 2 21 0.02210736 wave2
#19 9 1 19 -1.03520607 wave2
#20 10 2 24 0.54103663 wave2
you could also do it in three lines with dplyr
and the zoo
package.
library(dplyr)
library(zoo)
df3 <- dplyr::full_join(df1, df2)
df3 %>%
arrange(id) %>%
do(na.locf(.))
You can use mutate_at
and keep the first value for each id
:
df3 %>%
group_by(id) %>%
mutate_at(vars(sex,age),first) %>%
ungroup()
# # A tibble: 20 x 5
# id sex age x wave
# <int> <dbl> <int> <dbl> <chr>
# 1 1 1 20 -1.9380810 wave1
# 2 2 2 18 -1.6587271 wave1
# 3 3 1 19 -0.3262624 wave1
# 4 4 2 22 1.7939726 wave1
# 5 5 1 24 -0.7964016 wave1
# 6 6 2 22 0.3781070 wave1
# 7 7 1 18 -0.5051593 wave1
# 8 8 2 20 -0.4301633 wave1
# 9 9 1 18 2.0959696 wave1
# 10 10 2 23 0.8634686 wave1
# 11 1 1 20 2.3539693 wave2
# 12 2 2 18 0.5544678 wave2
# 13 3 1 19 -0.1502509 wave2
# 14 4 2 22 1.0797118 wave2
# 15 5 1 24 0.3716175 wave2
# 16 6 2 22 1.1135225 wave2
# 17 7 1 18 0.5832351 wave2
# 18 8 2 20 0.8694125 wave2
# 19 9 1 18 -0.3765263 wave2
# 20 10 2 23 -0.4019392 wave2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.