简体   繁体   中英

Update column of dataframe1 based on column of dataframe2 + create new row if column1 is not empty

I have a dataframe that I want to update with information from another dataframe, a lookup dataframe.

In particular, I'd like to update the cells of df1$value with the cells of df2$value based on the columns id and id2 .

  • If the cell of df1$value is NA , I know how to do it using the package data.table

BUT

  • If the cell of df1$value is not empty, data.table will update it with the cell of df2$value anyway.

I don't want that. I'd like to have that:

IF the cell of df1$value is NOT empty (in this case the row in which df1$id is c ), do not update the cell but create a duplicate row of df1 in which the cell of df1$value takes the value from the cell of df2$value

I already looked for solutions online but I couldn't find any. Is there a way to do it easily with tidyverse or data.table or an sql-like package?

Thank you for your help!

edit: I've just realized that I forgot to put the corner case in which in both dataframes the row is NA. With the replies I had so far ( 07/08/19 14:42 ) the row e is removed from the last dataframe. But I really need to keep it!

Outline:

> df1
  id id2 value
1 a         1   100
2 b         2   101
3 c         3    50
4 d         4    NA
5 e         5    NA

> df2
  id id2 value
1 c         3   200
2 d         4   201
3 e         5    NA

# I'd like:

> df5
  id id2 value
1 a         1   100
2 b         2   101
3 c         3    50
4 c         3   200
5 d         4   201
6 e         5    NA

This is how I managed to solve my problem but it's quite cumbersome.

# I create the dataframes
df1 <- data.frame(id=c('a', 'b', 'c', 'd'), id2=c(1,2,3,4),value=c(100, 101, 50, NA))
df2 <- data.frame(id=c('c', 'd', 'e'),id2=c(3,4, 5), value=c(200, 201, 300))

# I first do a left_join so I'll have two value columnes: value.x and value.y
df3 <- dplyr::left_join(df1, df2, by = c("id","id2"))

# > df3
#   id id2 value.x value.y
# 1  a   1     100      NA
# 2  b   2     101      NA
# 3  c   3      50     200
# 4  d   4      NA     201

# I keep only the rows in which value.x is NA, so the 4th row
df4 <- df3 %>%
  filter(is.na(value.x)) %>% 
  dplyr::select(id, id2, value.y)

# > df4
#   id id2 value.y
# 1  d   4     201

# I rename the column "value.y" to "value". (I don't do it with dplyr because the function dplyr::replace doesn't work in my R version)
colnames(df4)[colnames(df4) == "value.y"] <- "value"

# > df4
#   id id2 value
# 1  d   4     201

# I update the df1 with the df4$value. This step is necessary to update only the rows of df1 in which df1$value is NA
setDT(df1)[setDT(df4), on = c("id","id2"), `:=`(value = i.value)]

# > df1
#    id id2 value
# 1:  a   1   100
# 2:  b   2   101
# 3:  c   3    50
# 4:  d   4   201

# I filter only the rows in which both value.x and value.y are NAs
df3 <- as_tibble(df3) %>%
  filter(!is.na(value.x), !is.na(value.y)) %>% 
  dplyr::select(id, id2, value.y)

# > df3
# # A tibble: 1 x 3
#   id      id2 value.y
#   <chr> <dbl>   <dbl>
# 1 c         3     200

# I rename column df3$value.y to value
colnames(df3)[colnames(df3) == "value.y"] <- "value"

# I bind by rows df1 and df3 and I order by the column id
df5 <- rbind(df1, df3) %>% 
  arrange(id)

# > df5
#   id id2 value
# 1  a   1   100
# 2  b   2   101
# 3  c   3    50
# 4  c   3   200
# 5  d   4   201

A left join with data.table:

library(data.table)
setDT(df1); setDT(df2)

df2[df1, on=.(id, id2), .(value = 
  if (.N == 0) i.value 
  else na.omit(c(i.value, x.value))
), by=.EACHI]

   id id2 value
1:  a   1   100
2:  b   2   101
3:  c   3    50
4:  c   3   200
5:  d   4   201

How it works : The syntax is x[i, on=, j, by=.EACHI] : for each row of i = df1 do j .

In this case j = .(value = expr) where .() is a shortcut to list() since in general j should return a list of columns.

Regarding the expression, .N is the number of rows of x = df2 that are found for each row of i = df1 , so if no matches are found we keep values from i ; and otherwise we keep values from both tables, dropping missing values.


A dplyr way:

bind_rows(df1, semi_join(df2, df1, by=c("id", "id2"))) %>% 
  group_by(id, id2) %>% 
  do(if (nrow(.) == 1) . else na.omit(.))

# A tibble: 5 x 3
# Groups:   id, id2 [4]
  id      id2 value
  <chr> <dbl> <dbl>
1 a         1   100
2 b         2   101
3 c         3    50
4 c         3   200
5 d         4   201

Comment . The dplyr way is kind of awkward because do() is needed to get a dynamically determined number of rows, but do() is typically discouraged and does not support n() and other helper functions. The data.table way is kind of awkward because there is no simple semi join functionality.


Data :

df1 <- data.frame(id=c('a', 'b', 'c', 'd'), id2=c(1,2,3,4),value=c(100, 101, 50, NA))
df2 <- data.frame(id=c('c', 'd', 'e'),id2=c(3,4, 5), value=c(200, 201, 300))

> df1
  id id2 value
1  a   1   100
2  b   2   101
3  c   3    50
4  d   4    NA
> df2
  id id2 value
1  c   3   200
2  d   4   201
3  e   5   300

Another idea via base R is to remove the rows from df2 that do not match in df1 , bind the two data frames rowwise ( rbind ) and omit the NAs, ie

na.omit(rbind(df1, df2[do.call(paste, df2[1:2]) %in% do.call(paste, df1[1:2]),]))

#  id id2 value
#1  a   1   100
#2  b   2   101
#3  c   3    50
#5  c   3   200
#6  d   4   201

To answer your new requirements, we can keep the same rbind method and filter based on your conditions, ie

dd <- rbind(df1, df2[do.call(paste, df2[1:2]) %in% do.call(paste, df1[1:2]),])
dd[!!with(dd, ave(value, id, id2, FUN = function(i)(all(is.na(i)) & !duplicated(i)) | !is.na(i))),]

#  id id2 value
#1  a   1   100
#2  b   2   101
#3  c   3    50
#5  e   5    NA
#6  c   3   200
#7  d   4   201

A possible approach with data.table using update join then full outer merge:

merge(df1[is.na(value), value := df2[.SD, on=.(id, id2), x.value]], df2, all=TRUE)

output:

   id id2 value
1:  a   1   100
2:  b   2   101
3:  c   3    50
4:  c   3   200
5:  d   4   201
6:  e   5    NA

data:

library(data.table)
df1 <- data.table(id=c('a', 'b', 'c', 'd', 'e'), id2=c(1,2,3,4,5),value=c(100, 101, 50, NA, NA))
df2 <- data.table(id=c('c', 'd', 'e'), id2=c(3,4, 5), value=c(200, 201, NA))

Here is one way using full_join and gather

library(dplyr)

left_join(df1, df2, by = c("id","id2")) %>%
   tidyr::gather(key, value, starts_with("value"), na.rm = TRUE) %>%
   select(-key)

#   id id2 value
#1   a   1   100
#2   b   2   101
#3   c   3    50
#7   c   3   200
#8   d   4   201

For the updated case, we can do

left_join(df1, df2, by = c("id","id2")) %>%
   tidyr::gather(key, value, starts_with("value")) %>%
   group_by(id, id2) %>%
   filter((all(is.na(value)) & !duplicated(value)) | !is.na(value)) %>%
   select(-key)

#  id      id2 value
#  <chr> <int> <int>
#1 a         1   100
#2 b         2   101
#3 c         3    50
#4 e         5    NA
#5 c         3   200
#6 d         4   201

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM