[英]How to extract parts of string using R and place them in different columns
[英]How to extract one or more words from a string and search for them in two different columns form another file in R
我需要通過在 df1$Id 中提取“gene_id”之后的單詞並在 df2 的 2 個不同列(df2$Gene.id、df2$Gene.name)中搜索它們來查找 df2 中存在哪些 df1 行。
這是我的數據的樣子:
df1 <- data_frame(
Chr = c("NC_035077.1", "NC_035078.1", "NC_035083.1", "NC_035083.1", "NC_035084.1", "NC_035084.1", "NC_035088.1"),
Pos = c("61344375", "78462810", "24378412", "24387264","66360216", "66360385","40131947"),
Var=c("tco","born", "tco","tco", "born","tco","tco"),
Id=c("gene_id calm2", "gene_id LOC110500174", "gene_id ahcy", "gene_id ahcy", "gene_id cebpd", "gene_id cebpd", "gene_id LOC110537636, gene_id hsc70a")
)
df1
Chr Pos Var Id
<chr> <chr> <chr> <chr>
1 NC_035077.1 61344375 tco gene_id calm2
2 NC_035078.1 78462810 born gene_id LOC110500174
3 NC_035083.1 24378412 tco gene_id ahcy
4 NC_035083.1 24387264 tco gene_id ahcy
5 NC_035084.1 66360216 born gene_id cebpd
6 NC_035084.1 66360385 tco gene_id cebpd
7 NC_035088.1 40131947 tco gene_id LOC110537636, gene_id hsc70a
df2 <- data_frame(
Gene.id = c("LOC110488122", "NA", "LOC110490243", "LOC110537256", "LOC100136165", "LOC100379112", "LOC100379114", "LOC110527949", "LOC110537636"),
Gene.name = c("agr2", "agrn", "ahcy", "akap1","cebpb", "cebpb","cebpd", "ddost","slc6a13")
)
df2
Gene.id Gene.name
<chr> <chr>
1 LOC110488122 agr2
2 NA agrn
3 LOC110490243 ahcy
4 LOC110537256 akap1
5 LOC100136165 cebpb
6 LOC100379112 cebpb
7 LOC100379114 cebpd
8 LOC110527949 ddost
9 LOC110537636 slc6a13
如您所見,某些 df1$Id 有兩個gene_id,我需要在 df2 中檢查它們,並且無論它們中的哪一個與 df2$Gene.id 或 df2$Gene.name 匹配,我都需要將該行包含在output 文件。
df2 中也有一些 NA。
我的 output 應該是這樣的:
Chr Pos Var Id Gene.id Gene.name
NC_035083.1 24378412 tco gene_id ahcy LOC110490243 ahcy
NC_035083.1 24387264 tco gene_id ahcy LOC110490243 ahcy
NC_035084.1 66360216 born gene_id cebpd LOC100379114 cebpd
NC_035084.1 66360385 tco gene_id cebpd LOC100379114 cebpd
NC_035088.1 40131947 tco gene_id LOC110537636, gene_id hsc70a LOC110537636 slc6a13
任何有關如何實現這一目標的幫助將不勝感激。
這使用您的示例完成了技巧。 我假設您的所有數據都遵循相同的格式。
您需要創建一個干凈的“all in”列以將df1
與df2
連接起來。 進行兩次連接后,清理無用的行並使用“all in”列更新NA
的值。
NA
new_col
因為你不再需要它library(stringr) # for str_remove()
library(tidyr) # for separate_rows()
library(dplyr) # for everything else
df1 %>%
mutate(new_col = Id) %>%
separate_rows(new_col, sep = ", ") %>%
mutate(new_col = str_remove(new_col, "gene_id ")) %>%
left_join(df2, by = c("new_col" = "Gene.name")) %>%
left_join(df2, by = c("new_col" = "Gene.id")) %>%
filter(!is.na(Gene.name) | !is.na(Gene.id)) %>%
mutate(Gene.name = if_else(is.na(Gene.name), new_col, Gene.name),
Gene.id = if_else(is.na(Gene.id), new_col, Gene.id)) %>%
select(-new_col)
Chr Pos Var Id Gene.id Gene.name
<chr> <chr> <chr> <chr> <chr> <chr>
1 NC_035083.1 24378412 tco gene_id ahcy LOC110490243 ahcy
2 NC_035083.1 24387264 tco gene_id ahcy LOC110490243 ahcy
3 NC_035084.1 66360216 born gene_id cebpd LOC100379114 cebpd
4 NC_035084.1 66360385 tco gene_id cebpd LOC100379114 cebpd
5 NC_035088.1 40131947 tco gene_id LOC110537636, gene_id hsc70a LOC110537636 slc6a13
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.