简体   繁体   中英

After full_join() how to replace NAs in one source with data from other source

When joining two dataframes, how can I replace missing values in one dataset with values from the other dataset?

My working example comes from a 3 wave (time points) study, where some questions where omitted from consecutive waves. I want to produce a full dataset with all waves in a long format, that I can easly split into smaller sets while keeping all the variables meaningfull.

Here is some reproducible code:

df1<-data.frame(id=seq(10),
                sex=rep(c(1,2), 5),
                age=sample(c(18:24), 10, replace = T),
                x = rnorm(10),
                wave = rep("wave1", 10))

df2<-data.frame(id=seq(10),
                x = rnorm(10),
                wave = rep("wave2", 10))

dplyr::full_join(df1, df2)

Joining, by = c("id", "x", "wave")
   id sex age          x  wave
1   1   1  18  0.7236847 wave1
2   2   2  18  0.5730599 wave1
3   3   1  21  2.0341799 wave1
4   4   2  20 -0.1531575 wave1
5   5   1  18 -0.6089901 wave1
6   6   2  18 -0.3233804 wave1
7   7   1  19 -0.1417807 wave1
8   8   2  21  0.9557512 wave1
9   9   1  24  0.6522168 wave1
10 10   2  20  0.1595824 wave1
11  1  NA  NA  1.9694018 wave2
12  2  NA  NA  1.4153806 wave2
13  3  NA  NA  1.1160011 wave2
14  4  NA  NA -0.6040353 wave2
15  5  NA  NA -0.3750569 wave2
16  6  NA  NA  0.4826182 wave2
17  7  NA  NA  0.7210480 wave2
18  8  NA  NA  1.9068413 wave2
19  9  NA  NA  1.5355046 wave2
20 10  NA  NA  1.3607414 wave2

My goal is: Based on participant id replace NA in sex and age for wave2 measurments with wave1 data.

EDIT :Please assume, that I no longer have access to df1 and df2 - I'm working with the joint data only, and in reality there are more variables that come with ``NA`s. I should have specified this earlier.

update

without access to df1 and df2, you can use zoo 's na.locf -function

df <- dplyr::full_join(df1, df2)

library( zoo )
library( data.table )

dt <- setDT(df)[, `:=`( sex = zoo::na.locf( zoo::na.locf( sex, na.rm = FALSE ) ),
                        age = zoo::na.locf( zoo::na.locf( age, na.rm = FALSE ) ) ), by = id ]
dt

#    id sex age           x  wave
# 1:  1   1  22 -1.03971504 wave1
# 2:  2   2  22 -0.40848104 wave1
# 3:  3   1  18 -0.32354030 wave1
# 4:  4   2  23  0.01220463 wave1
# 5:  5   1  24  0.83800380 wave1
# 6:  6   2  19  0.31674395 wave1
# 7:  7   1  22 -0.62997068 wave1
# 8:  8   2  19 -0.02830660 wave1
# 9:  9   1  23 -0.48257814 wave1
# 10: 10   2  24 -0.82725441 wave1
# 11:  1   1  22 -2.04179796 wave2
# 12:  2   2  22  1.66578389 wave2
# 13:  3   1  18  0.63893257 wave2
# 14:  4   2  23  0.37758646 wave2
# 15:  5   1  24 -1.64174887 wave2
# 16:  6   2  19 -2.93152667 wave2
# 17:  7   1  22  0.14474519 wave2
# 18:  8   2  19 -1.18826640 wave2
# 19:  9   1  23  0.68365951 wave2
# 20: 10   2  24 -0.21636650 wave2

You actually need to rbind not merge, so you can create the two extra columns and rbind , ie

rbind(df1, data.frame(df2, sex = df1$sex, age = df1$age))

which gives,

  id sex age x wave 1 1 1 24 0.23277867 wave1 2 2 2 19 0.28211730 wave1 3 3 1 23 0.69541360 wave1 4 4 2 21 0.11846487 wave1 5 5 1 23 -0.08540101 wave1 6 6 2 19 1.55917732 wave1 7 7 1 20 -0.27636738 wave1 8 8 2 20 -1.55094487 wave1 9 9 1 21 1.60901222 wave1 10 10 2 21 -0.05709374 wave1 11 1 1 24 -0.86825838 wave2 12 2 2 19 -0.32215557 wave2 13 3 1 23 -1.29894673 wave2 14 4 2 21 -0.24631532 wave2 15 5 1 23 2.65130947 wave2 16 6 2 19 0.03424642 wave2 17 7 1 20 0.55383179 wave2 18 8 2 20 0.09771911 wave2 19 9 1 21 -0.14435681 wave2 20 10 2 21 -1.66916275 wave2 

If you want to consider changing values after join we can match and then update values

df3 <- dplyr::full_join(df1, df2)

inds <- match(df3$id[df3$wave == "wave1"], df3$id[df3$wave == "wave2"])
df3[df3$wave == "wave2", c("sex", "age")] <- df3[inds, c("sex", "age")]

#   id sex age           x  wave
#1   1   1  24 -0.76956510 wave1
#......
#......
#16  6   2  24 -0.25209124 wave2
#17  7   1  24  1.93524314 wave2
#18  8   2  21  0.02210736 wave2
#19  9   1  19 -1.03520607 wave2
#20 10   2  24  0.54103663 wave2

you could also do it in three lines with dplyr and the zoo package.

library(dplyr)
library(zoo)
df3 <- dplyr::full_join(df1, df2)
df3 %>% 
arrange(id) %>% 
do(na.locf(.))

You can use mutate_at and keep the first value for each id :

df3 %>%
  group_by(id) %>%
  mutate_at(vars(sex,age),first) %>%
  ungroup()
# # A tibble: 20 x 5
#       id   sex   age          x  wave
#    <int> <dbl> <int>      <dbl> <chr>
#  1     1     1    20 -1.9380810 wave1
#  2     2     2    18 -1.6587271 wave1
#  3     3     1    19 -0.3262624 wave1
#  4     4     2    22  1.7939726 wave1
#  5     5     1    24 -0.7964016 wave1
#  6     6     2    22  0.3781070 wave1
#  7     7     1    18 -0.5051593 wave1
#  8     8     2    20 -0.4301633 wave1
#  9     9     1    18  2.0959696 wave1
# 10    10     2    23  0.8634686 wave1
# 11     1     1    20  2.3539693 wave2
# 12     2     2    18  0.5544678 wave2
# 13     3     1    19 -0.1502509 wave2
# 14     4     2    22  1.0797118 wave2
# 15     5     1    24  0.3716175 wave2
# 16     6     2    22  1.1135225 wave2
# 17     7     1    18  0.5832351 wave2
# 18     8     2    20  0.8694125 wave2
# 19     9     1    18 -0.3765263 wave2
# 20    10     2    23 -0.4019392 wave2 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM