简体   繁体   English

根据dataframe2的列更新dataframe1的列,如果column1不为空,则创建新行

[英]Update column of dataframe1 based on column of dataframe2 + create new row if column1 is not empty

I have a dataframe that I want to update with information from another dataframe, a lookup dataframe. 我有一个数据框,我想用另一个数据帧,查询数据帧的信息更新。

In particular, I'd like to update the cells of df1$value with the cells of df2$value based on the columns id and id2 . 特别是,我想根据列idid2df2$value的单元格更新df1 $ value的单元格。

  • If the cell of df1$value is NA , I know how to do it using the package data.table 如果df1$value的单元格是NA ,我知道如何使用包data.table来完成它

BUT

  • If the cell of df1$value is not empty, data.table will update it with the cell of df2$value anyway. 如果df1$value的单元格不为空,则data.table将使用df2$value的单元格更新它。

I don't want that. 我不希望这样。 I'd like to have that: 我想要那个:

IF the cell of df1$value is NOT empty (in this case the row in which df1$id is c ), do not update the cell but create a duplicate row of df1 in which the cell of df1$value takes the value from the cell of df2$value 如果df1$value的单元格不为空(在这种情况下是df1$idc ),请不要更新单元格,而是创建一个重复的df1行,其中df1 $ value的单元格取值df2$value单元格

I already looked for solutions online but I couldn't find any. 我已经在网上寻找解决方案,但我找不到任何解决方案。 Is there a way to do it easily with tidyverse or data.table or an sql-like package? 有没有办法用tidyverse或data.table或sql-like包来轻松实现?

Thank you for your help! 谢谢您的帮助!

edit: I've just realized that I forgot to put the corner case in which in both dataframes the row is NA. 编辑:我刚刚意识到我忘了把两个数据帧中的行的情况放在NA的情况下。 With the replies I had so far ( 07/08/19 14:42 ) the row e is removed from the last dataframe. 到目前为止的回复( 07/08/19 14:42 ),行e从最后一个数据帧中删除。 But I really need to keep it! 但我真的需要保留它!

Outline: 大纲:

> df1
  id id2 value
1 a         1   100
2 b         2   101
3 c         3    50
4 d         4    NA
5 e         5    NA

> df2
  id id2 value
1 c         3   200
2 d         4   201
3 e         5    NA

# I'd like:

> df5
  id id2 value
1 a         1   100
2 b         2   101
3 c         3    50
4 c         3   200
5 d         4   201
6 e         5    NA

This is how I managed to solve my problem but it's quite cumbersome. 这就是我设法解决问题的方法,但它非常麻烦。

# I create the dataframes
df1 <- data.frame(id=c('a', 'b', 'c', 'd'), id2=c(1,2,3,4),value=c(100, 101, 50, NA))
df2 <- data.frame(id=c('c', 'd', 'e'),id2=c(3,4, 5), value=c(200, 201, 300))

# I first do a left_join so I'll have two value columnes: value.x and value.y
df3 <- dplyr::left_join(df1, df2, by = c("id","id2"))

# > df3
#   id id2 value.x value.y
# 1  a   1     100      NA
# 2  b   2     101      NA
# 3  c   3      50     200
# 4  d   4      NA     201

# I keep only the rows in which value.x is NA, so the 4th row
df4 <- df3 %>%
  filter(is.na(value.x)) %>% 
  dplyr::select(id, id2, value.y)

# > df4
#   id id2 value.y
# 1  d   4     201

# I rename the column "value.y" to "value". (I don't do it with dplyr because the function dplyr::replace doesn't work in my R version)
colnames(df4)[colnames(df4) == "value.y"] <- "value"

# > df4
#   id id2 value
# 1  d   4     201

# I update the df1 with the df4$value. This step is necessary to update only the rows of df1 in which df1$value is NA
setDT(df1)[setDT(df4), on = c("id","id2"), `:=`(value = i.value)]

# > df1
#    id id2 value
# 1:  a   1   100
# 2:  b   2   101
# 3:  c   3    50
# 4:  d   4   201

# I filter only the rows in which both value.x and value.y are NAs
df3 <- as_tibble(df3) %>%
  filter(!is.na(value.x), !is.na(value.y)) %>% 
  dplyr::select(id, id2, value.y)

# > df3
# # A tibble: 1 x 3
#   id      id2 value.y
#   <chr> <dbl>   <dbl>
# 1 c         3     200

# I rename column df3$value.y to value
colnames(df3)[colnames(df3) == "value.y"] <- "value"

# I bind by rows df1 and df3 and I order by the column id
df5 <- rbind(df1, df3) %>% 
  arrange(id)

# > df5
#   id id2 value
# 1  a   1   100
# 2  b   2   101
# 3  c   3    50
# 4  c   3   200
# 5  d   4   201

A left join with data.table: 与data.table的左连接:

library(data.table)
setDT(df1); setDT(df2)

df2[df1, on=.(id, id2), .(value = 
  if (.N == 0) i.value 
  else na.omit(c(i.value, x.value))
), by=.EACHI]

   id id2 value
1:  a   1   100
2:  b   2   101
3:  c   3    50
4:  c   3   200
5:  d   4   201

How it works : The syntax is x[i, on=, j, by=.EACHI] : for each row of i = df1 do j . 工作原理 :语法为x[i, on=, j, by=.EACHI] :对于i = df1 do j每一行。

In this case j = .(value = expr) where .() is a shortcut to list() since in general j should return a list of columns. 在这种情况下, j = .(value = expr)其中.()list()的快捷方式,因为通常j应返回列列表。

Regarding the expression, .N is the number of rows of x = df2 that are found for each row of i = df1 , so if no matches are found we keep values from i ; 关于表达式, .N是为i = df1每一行找到的x = df2的行数,因此如果没有找到匹配,我们保持i值; and otherwise we keep values from both tables, dropping missing values. 否则我们保留两个表中的值,删除缺失值。


A dplyr way: 一个dplyr方式:

bind_rows(df1, semi_join(df2, df1, by=c("id", "id2"))) %>% 
  group_by(id, id2) %>% 
  do(if (nrow(.) == 1) . else na.omit(.))

# A tibble: 5 x 3
# Groups:   id, id2 [4]
  id      id2 value
  <chr> <dbl> <dbl>
1 a         1   100
2 b         2   101
3 c         3    50
4 c         3   200
5 d         4   201

Comment . 评论 The dplyr way is kind of awkward because do() is needed to get a dynamically determined number of rows, but do() is typically discouraged and does not support n() and other helper functions. dplyr方式有点尴尬,因为需要do()来获得动态确定的行数,但通常不鼓励do()并且不支持n()和其他辅助函数。 The data.table way is kind of awkward because there is no simple semi join functionality. data.table方式有点尴尬,因为没有简单的半连接功能。


Data : 数据

df1 <- data.frame(id=c('a', 'b', 'c', 'd'), id2=c(1,2,3,4),value=c(100, 101, 50, NA))
df2 <- data.frame(id=c('c', 'd', 'e'),id2=c(3,4, 5), value=c(200, 201, 300))

> df1
  id id2 value
1  a   1   100
2  b   2   101
3  c   3    50
4  d   4    NA
> df2
  id id2 value
1  c   3   200
2  d   4   201
3  e   5   300

Another idea via base R is to remove the rows from df2 that do not match in df1 , bind the two data frames rowwise ( rbind ) and omit the NAs, ie 通过基础R的另一个想法是从df2中删除df1不匹配的行,按行( rbind )绑定两个数据帧并省略NA,即

na.omit(rbind(df1, df2[do.call(paste, df2[1:2]) %in% do.call(paste, df1[1:2]),]))

#  id id2 value
#1  a   1   100
#2  b   2   101
#3  c   3    50
#5  c   3   200
#6  d   4   201

To answer your new requirements, we can keep the same rbind method and filter based on your conditions, ie 为了满足您的新要求,我们可以保持相同的rbind方法并根据您的条件进行过滤,即

dd <- rbind(df1, df2[do.call(paste, df2[1:2]) %in% do.call(paste, df1[1:2]),])
dd[!!with(dd, ave(value, id, id2, FUN = function(i)(all(is.na(i)) & !duplicated(i)) | !is.na(i))),]

#  id id2 value
#1  a   1   100
#2  b   2   101
#3  c   3    50
#5  e   5    NA
#6  c   3   200
#7  d   4   201

A possible approach with data.table using update join then full outer merge: 使用update join然后使用完全外部合并的data.table的可能方法:

merge(df1[is.na(value), value := df2[.SD, on=.(id, id2), x.value]], df2, all=TRUE)

output: 输出:

   id id2 value
1:  a   1   100
2:  b   2   101
3:  c   3    50
4:  c   3   200
5:  d   4   201
6:  e   5    NA

data: 数据:

library(data.table)
df1 <- data.table(id=c('a', 'b', 'c', 'd', 'e'), id2=c(1,2,3,4,5),value=c(100, 101, 50, NA, NA))
df2 <- data.table(id=c('c', 'd', 'e'), id2=c(3,4, 5), value=c(200, 201, NA))

Here is one way using full_join and gather 这是使用full_joingather一种方法

library(dplyr)

left_join(df1, df2, by = c("id","id2")) %>%
   tidyr::gather(key, value, starts_with("value"), na.rm = TRUE) %>%
   select(-key)

#   id id2 value
#1   a   1   100
#2   b   2   101
#3   c   3    50
#7   c   3   200
#8   d   4   201

For the updated case, we can do 对于更新的案例,我们可以做到

left_join(df1, df2, by = c("id","id2")) %>%
   tidyr::gather(key, value, starts_with("value")) %>%
   group_by(id, id2) %>%
   filter((all(is.na(value)) & !duplicated(value)) | !is.na(value)) %>%
   select(-key)

#  id      id2 value
#  <chr> <int> <int>
#1 a         1   100
#2 b         2   101
#3 c         3    50
#4 e         5    NA
#5 c         3   200
#6 d         4   201

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据两个R中的匹配列从column2(dataframe2)中减去column1(dataframe1) - subtract column1 (dataframe1) from column2 (dataframe2) based on matching column in both R 将dataframe1转换为dataframe2,其中列的值成为新列 - Transforming dataframe1 to dataframe2 where the values of a column become new columns 如何从R中的dataframe1中选择dataframe $ 1column的行中选择行 - how to select rows from a dataframe1 in R where dataframe$1column is found somewhere in dataframe2$column R循环查看dataframe1中的列值是否与dataframe2中的列值匹配 - R loop to see if column value(s) from dataframe1 match column values from dataframe2 根据特定条件与dataframe2匹配的有效方式来更新dataframe1中的特定行 - Efficient way to update specific row in dataframe1 based on specific condition match with that of dataframe2 在dataframe中新建一列,根据R中的行名向其中插入数据 - Create a new column in a dataframe and insert data in it based on row names in R 在数据框中创建一个新列 - Create a new column in a dataframe 在 R dataframe 中使用行总和创建一个新列 - Create a new column in R dataframe with row sum Dataframe基于其他列创建新列 - Dataframe create new column based on other columns 基于匹配其他列的部分字符串在数据框中创建新列 - Create new column in dataframe based on partial string matching other column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM