[英]Update column of dataframe1 based on column of dataframe2 + create new row if column1 is not empty
I have a dataframe that I want to update with information from another dataframe, a lookup dataframe. 我有一个数据框,我想用另一个数据帧,查询数据帧的信息更新。
In particular, I'd like to update the cells of df1$value with the cells of df2$value
based on the columns id
and id2
. 特别是,我想根据列id
和id2
用df2$value
的单元格更新df1 $ value的单元格。
df1$value
is NA
, I know how to do it using the package data.table
如果df1$value
的单元格是NA
,我知道如何使用包data.table
来完成它 BUT 但
df1$value
is not empty, data.table will update it with the cell of df2$value
anyway. 如果df1$value
的单元格不为空,则data.table将使用df2$value
的单元格更新它。 I don't want that. 我不希望这样。 I'd like to have that: 我想要那个:
IF the cell of df1$value
is NOT empty (in this case the row in which df1$id
is c
), do not update the cell but create a duplicate row of df1 in which the cell of df1$value takes the value from the cell of df2$value
如果df1$value
的单元格不为空(在这种情况下是df1$id
为c
),请不要更新单元格,而是创建一个重复的df1行,其中df1 $ value的单元格取值df2$value
单元格
I already looked for solutions online but I couldn't find any. 我已经在网上寻找解决方案,但我找不到任何解决方案。 Is there a way to do it easily with tidyverse or data.table or an sql-like
package? 有没有办法用tidyverse或data.table或sql-like
包来轻松实现?
Thank you for your help! 谢谢您的帮助!
edit: I've just realized that I forgot to put the corner case in which in both dataframes the row is NA. 编辑:我刚刚意识到我忘了把两个数据帧中的行的情况放在NA的情况下。 With the replies I had so far ( 07/08/19 14:42
) the row e
is removed from the last dataframe. 到目前为止的回复( 07/08/19 14:42
),行e
从最后一个数据帧中删除。 But I really need to keep it! 但我真的需要保留它!
Outline: 大纲:
> df1
id id2 value
1 a 1 100
2 b 2 101
3 c 3 50
4 d 4 NA
5 e 5 NA
> df2
id id2 value
1 c 3 200
2 d 4 201
3 e 5 NA
# I'd like:
> df5
id id2 value
1 a 1 100
2 b 2 101
3 c 3 50
4 c 3 200
5 d 4 201
6 e 5 NA
This is how I managed to solve my problem but it's quite cumbersome. 这就是我设法解决问题的方法,但它非常麻烦。
# I create the dataframes
df1 <- data.frame(id=c('a', 'b', 'c', 'd'), id2=c(1,2,3,4),value=c(100, 101, 50, NA))
df2 <- data.frame(id=c('c', 'd', 'e'),id2=c(3,4, 5), value=c(200, 201, 300))
# I first do a left_join so I'll have two value columnes: value.x and value.y
df3 <- dplyr::left_join(df1, df2, by = c("id","id2"))
# > df3
# id id2 value.x value.y
# 1 a 1 100 NA
# 2 b 2 101 NA
# 3 c 3 50 200
# 4 d 4 NA 201
# I keep only the rows in which value.x is NA, so the 4th row
df4 <- df3 %>%
filter(is.na(value.x)) %>%
dplyr::select(id, id2, value.y)
# > df4
# id id2 value.y
# 1 d 4 201
# I rename the column "value.y" to "value". (I don't do it with dplyr because the function dplyr::replace doesn't work in my R version)
colnames(df4)[colnames(df4) == "value.y"] <- "value"
# > df4
# id id2 value
# 1 d 4 201
# I update the df1 with the df4$value. This step is necessary to update only the rows of df1 in which df1$value is NA
setDT(df1)[setDT(df4), on = c("id","id2"), `:=`(value = i.value)]
# > df1
# id id2 value
# 1: a 1 100
# 2: b 2 101
# 3: c 3 50
# 4: d 4 201
# I filter only the rows in which both value.x and value.y are NAs
df3 <- as_tibble(df3) %>%
filter(!is.na(value.x), !is.na(value.y)) %>%
dplyr::select(id, id2, value.y)
# > df3
# # A tibble: 1 x 3
# id id2 value.y
# <chr> <dbl> <dbl>
# 1 c 3 200
# I rename column df3$value.y to value
colnames(df3)[colnames(df3) == "value.y"] <- "value"
# I bind by rows df1 and df3 and I order by the column id
df5 <- rbind(df1, df3) %>%
arrange(id)
# > df5
# id id2 value
# 1 a 1 100
# 2 b 2 101
# 3 c 3 50
# 4 c 3 200
# 5 d 4 201
A left join with data.table: 与data.table的左连接:
library(data.table)
setDT(df1); setDT(df2)
df2[df1, on=.(id, id2), .(value =
if (.N == 0) i.value
else na.omit(c(i.value, x.value))
), by=.EACHI]
id id2 value
1: a 1 100
2: b 2 101
3: c 3 50
4: c 3 200
5: d 4 201
How it works : The syntax is x[i, on=, j, by=.EACHI]
: for each row of i = df1
do j
. 工作原理 :语法为x[i, on=, j, by=.EACHI]
:对于i = df1
do j
每一行。
In this case j = .(value = expr)
where .()
is a shortcut to list()
since in general j
should return a list of columns. 在这种情况下, j = .(value = expr)
其中.()
是list()
的快捷方式,因为通常j
应返回列列表。
Regarding the expression, .N
is the number of rows of x = df2
that are found for each row of i = df1
, so if no matches are found we keep values from i
; 关于表达式, .N
是为i = df1
每一行找到的x = df2
的行数,因此如果没有找到匹配,我们保持i
值; and otherwise we keep values from both tables, dropping missing values. 否则我们保留两个表中的值,删除缺失值。
A dplyr way: 一个dplyr方式:
bind_rows(df1, semi_join(df2, df1, by=c("id", "id2"))) %>%
group_by(id, id2) %>%
do(if (nrow(.) == 1) . else na.omit(.))
# A tibble: 5 x 3
# Groups: id, id2 [4]
id id2 value
<chr> <dbl> <dbl>
1 a 1 100
2 b 2 101
3 c 3 50
4 c 3 200
5 d 4 201
Comment . 评论 。 The dplyr way is kind of awkward because do()
is needed to get a dynamically determined number of rows, but do()
is typically discouraged and does not support n()
and other helper functions. dplyr方式有点尴尬,因为需要do()
来获得动态确定的行数,但通常不鼓励do()
并且不支持n()
和其他辅助函数。 The data.table way is kind of awkward because there is no simple semi join functionality. data.table方式有点尴尬,因为没有简单的半连接功能。
Data : 数据 :
df1 <- data.frame(id=c('a', 'b', 'c', 'd'), id2=c(1,2,3,4),value=c(100, 101, 50, NA))
df2 <- data.frame(id=c('c', 'd', 'e'),id2=c(3,4, 5), value=c(200, 201, 300))
> df1
id id2 value
1 a 1 100
2 b 2 101
3 c 3 50
4 d 4 NA
> df2
id id2 value
1 c 3 200
2 d 4 201
3 e 5 300
Another idea via base R is to remove the rows from df2
that do not match in df1
, bind the two data frames rowwise ( rbind
) and omit the NAs, ie 通过基础R的另一个想法是从df2
中删除df1
不匹配的行,按行( rbind
)绑定两个数据帧并省略NA,即
na.omit(rbind(df1, df2[do.call(paste, df2[1:2]) %in% do.call(paste, df1[1:2]),]))
# id id2 value
#1 a 1 100
#2 b 2 101
#3 c 3 50
#5 c 3 200
#6 d 4 201
To answer your new requirements, we can keep the same rbind
method and filter based on your conditions, ie 为了满足您的新要求,我们可以保持相同的rbind
方法并根据您的条件进行过滤,即
dd <- rbind(df1, df2[do.call(paste, df2[1:2]) %in% do.call(paste, df1[1:2]),])
dd[!!with(dd, ave(value, id, id2, FUN = function(i)(all(is.na(i)) & !duplicated(i)) | !is.na(i))),]
# id id2 value
#1 a 1 100
#2 b 2 101
#3 c 3 50
#5 e 5 NA
#6 c 3 200
#7 d 4 201
A possible approach with data.table using update join then full outer merge: 使用update join然后使用完全外部合并的data.table的可能方法:
merge(df1[is.na(value), value := df2[.SD, on=.(id, id2), x.value]], df2, all=TRUE)
output: 输出:
id id2 value
1: a 1 100
2: b 2 101
3: c 3 50
4: c 3 200
5: d 4 201
6: e 5 NA
data: 数据:
library(data.table)
df1 <- data.table(id=c('a', 'b', 'c', 'd', 'e'), id2=c(1,2,3,4,5),value=c(100, 101, 50, NA, NA))
df2 <- data.table(id=c('c', 'd', 'e'), id2=c(3,4, 5), value=c(200, 201, NA))
Here is one way using full_join
and gather
这是使用full_join
和gather
一种方法
library(dplyr)
left_join(df1, df2, by = c("id","id2")) %>%
tidyr::gather(key, value, starts_with("value"), na.rm = TRUE) %>%
select(-key)
# id id2 value
#1 a 1 100
#2 b 2 101
#3 c 3 50
#7 c 3 200
#8 d 4 201
For the updated case, we can do 对于更新的案例,我们可以做到
left_join(df1, df2, by = c("id","id2")) %>%
tidyr::gather(key, value, starts_with("value")) %>%
group_by(id, id2) %>%
filter((all(is.na(value)) & !duplicated(value)) | !is.na(value)) %>%
select(-key)
# id id2 value
# <chr> <int> <int>
#1 a 1 100
#2 b 2 101
#3 c 3 50
#4 e 5 NA
#5 c 3 200
#6 d 4 201
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.