[英]merging and filling the NA values of another column based on another dataframe
我有 2 個 dfs,其中一個子集看起來像這樣。 在可用的情況下,我希望將“NA”值替換為其他 df 中的 rsid 值。
df1:
SNP A1 A2 rsid
1:100000012 A G rs1234
1:1000066 T C <NA>
1:2032101 C T rs5678
df2:
SNP A1 A2 rsid
2:107877 A G rs1112023
3:1000066 T C rs8213723
1:1000066 T C rs7778899
這就是我想要的,其中 NA 被其他 df 的 rsid 值替換。 在此示例中,df2 的第 3 行的 rsid 替換了 df1 的第 2 行的 rsid 的 NA 值。 我只希望新的 df 在 df1 中包含行,就像這樣。
df3
SNP A1 A2 rsid
1:100000012 A G rs1234
1:1000066 T C rs7778899
1:2032101 C T rs5678
我試過這個,但收到一些錯誤消息。 有人可以幫忙嗎?
library(dplyr)
bind_rows(df1, df2) %>%
group_by(SNP, A1, A2) %>%
summarise(rsid = rsid[complete.cases(rsid)], .groups = 'drop')
Error: Column `rsid` must be length 1 (a summary value), not 2
In addition: Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
我們可以將數據集與bind_rows
綁定在一起,然后通過匯總進行分組,同時使用complete.cases
刪除NA
( dplyr
版本>= 1.0
)
library(dplyr)
bind_rows(df1, df2) %>%
group_by(SNP, A1, A2) %>%
summarise(rsid = rsid[complete.cases(rsid)], .groups = 'drop')
-輸出
# A tibble: 5 x 4
# SNP A1 A2 rsid
# <chr> <chr> <chr> <chr>
#1 1:100000012 A G rs1234
#2 1:1000066 T C rs7778899
#3 1:2032101 C T rs5678
#4 2:107877 A G rs1112023
#5 3:1000066 T C rs8213723
如果dplyr
的版本< 1.0
,則summarise
預計 output 每組的length
1。 我們可以將它包裝在一個list
,然后unnest
bind_rows(df1, df2) %>%
group_by(SNP, A1, A2) %>%
summarise(rsid = list(rsid[complete.cases(rsid)])) %>%
ungroup %>%
unnest(c(rsid))
根據更新后的帖子,如果我們需要根據第二個數據更新列“rsid”,一個選項是進行連接,然后在合並“rsid”列后分配( :=
)
library(data.table)
setDT(df1)[df2, rsid := fcoalesce(rsid, i.rsid), on = .(SNP, A1, A2)]
-輸出
df1
# SNP A1 A2 rsid
#1: 1:100000012 A G rs1234
#2: 1:1000066 T C rs7778899
#3: 1:2032101 C T rs5678
dplyr
也可以使用類似的選項
left_join(df1, df2, by = c('SNP', 'A1', 'A2')) %>%
transmute(SNP, A1, A2, rsid = coalesce(rsid.x, rsid))
df1 <- structure(list(SNP = c("1:100000012", "1:1000066", "1:2032101"
), A1 = c("A", "T", "C"), A2 = c("G", "C", "T"), rsid = c("rs1234",
NA, "rs5678")), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(SNP = c("2:107877", "3:1000066", "1:1000066"),
A1 = c("A", "T", "T"), A2 = c("G", "C", "C"), rsid = c("rs1112023",
"rs8213723", "rs7778899")), class = "data.frame", row.names = c(NA,
-3L))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.