简体   繁体   中英

Fill missing values from another dataframe with the same columns

I searched various join questions and none seemed to quite answer this. I have two dataframes which each have an ID column and several information columns.

df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25), 
                  rep(NA, 25)), phase = c(rep("liquid", 50), rep("gas", 50)),
                  rand.col = rnorm(100))

df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))

As you can see, df1 is missing some info that is present in df2, while df2 is only a subset of all the ids, but they both have some similar columns. Is there a way to fill the missing values in df1 based on matching ID's from DF2?

I found a similar question that recommended using merge, but when I tried it, it dropped all the id's that were not present in both dataframes. Plus it required manually dropping duplicate columns and in my real dataset, there will be a large number of these, making doing so cumbersome. Even ignoring that though,

both the recommended solutions:

df1 <- setNames(merge(df1, df2)[-2], names(df1))

and

df1[is.na(df1$color), "color"] <- df2[match(df1$id, df2$id), "color"][which(is.na(df1$color))]

did not work for me, throwing various errors.

An alternate solution I have thought of is using rbind then dropping incomplete cases. The problem is that in my real dataset, while there are shared columns, there are also non-shared columns so I would have to create intermediate objects of just the shared columns, rbind , then drop incomplete cases, then join with the original object to regain the dropped columns. This seems unnecessarily roundabout.

In this example it would look like

df2 = rbind(df1[,colnames(df2)], df2)
df2 = df2[complete.cases(df2),]
df2 = merge(df1[,c("id", "rand.col")], df2, by = "id")

and, in case there are any fully duplicated rows between the two dataframes, I would need to add

df2 = unique(df2)

This solution will work, but it is cumbersome and as the number of columns that are being matched on increase, it gets even worse. Is there a better solution?

-edit- fixed a problem in my example data pointed out by Sathish

-edit2- Expanded example data

df1 = data.frame(id = c(1:100),  wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50), 
wq5 = rnorm(50))

df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50), 
wq5 = rnorm(50))

These dataframe represents the case where there are many columns that have incomplete data and a second dataframe that has all of the missing data. Ideally, we would not need to separately list each each column with wq1 := i.wq1 etc.

If you want to join only by id column, you can remove phase in the on clause of code below.

Also your data in the question has discrepancies, which are corrected in the data posted in this answer.

library('data.table')
setDT(df1)  # make data table by reference
setDT(df2)  # make data table by reference
df1[ i = df2, color := i.color, on = .(id, phase)] # join df1 with df2 by id and phase values, and replace color values of df2 with color values of df1

tail(df1)
#     id color phase   rand.col
# 1:  95 green   gas  1.5868335
# 2:  96 green   gas  0.5584864
# 3:  97 green   gas -1.2765922
# 4:  98 green   gas -0.5732654
# 5:  99 green   gas -1.2246126
# 6: 100 green   gas -0.4734006

one-liner:

setDT(df1)[df2, color := i.color, on = .(id, phase)]

Data:

set.seed(1L)
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25), 
                                           rep(NA, 50)), phase = c(rep("liquid", 50), rep("gas", 50)),
                  rand.col = rnorm(100))

df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))

EDIT: based on new data posted in the question

Data:

set.seed(1L)
df1 = data.frame(id = c(1:100),  wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50), 
                 wq5 = rnorm(50))
set.seed(2423L)
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50), 
                 wq5 = rnorm(50))

Code:

library('data.table')
setDT(df1)[ id == 52, ]
#    id       wq2        wq3        wq4         wq5
# 1: 52 0.1836433 -0.6120264 0.04211587 -0.01855983

setDT(df2)[ id == 52, ]
#    id       wq2       wq3        wq4       wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687

df1[df2, `:=` ( wq2 = i.wq2,
                wq3 = i.wq3,
                wq4 = i.wq4,
                wq5 = i.wq5), on = .(id)]

setDT(df1)[ id == 52, ]
#    id       wq2       wq3        wq4       wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM