R通过注释数据框替换列中的单词

Question

I have a dataframe with in 1 column gene IDs (data1).我有一个包含 1 列基因 ID (data1) 的数据框。 In another dataframe I have the corresponding gene names (data2).在另一个数据框中，我有相应的基因名称（data2）。 Data1 also contains cells with multiple genenames, separated with ':', and also a lot of NAs. Data1 还包含具有多个基因名的细胞，用“:”分隔，还有很多 NA。 Preferably I want to add a column to data1 with the corresponding gene names, also separated by ':' if there are multiple.最好我想用相应的基因名称向 data1 添加一列，如果有多个，也用“：”分隔。 An alternative would be to replace all the genenames in data1 with the corresponding gene names.另一种方法是用相应的基因名称替换 data1 中的所有基因名称。 Any idea how to go about this?知道如何解决这个问题吗？ Thanks!谢谢！

a <- c("ENSG00000150401:ENSG00000150403", "ENSG00000185294", "NA")
data1 <- data.frame(a)


b <- c("ENSG00000150401", "ENSG00000150403", "ENSG00000185294")
c <- c("GeneA", "GeneB", "GeneC")
data2 <- data.frame(b,c)

Answer 1

One option involving stringr could be:涉及stringr一种选择可能是：

data1$res <- str_replace_all(data1$a, setNames(data2$c, data2$b))

                                a         res
1 ENSG00000150401:ENSG00000150403 GeneA:GeneB
2                 ENSG00000185294       GeneC
3                              NA          NA

Answer 2

We can get data1 in long format, left_join data2 and paste values together.我们可以得到长格式的data1 ， left_join data2并将值粘贴在一起。

library(dplyr)

data1 %>%
  mutate(row = row_number()) %>%
  tidyr::separate_rows(a, sep = ":") %>%
  left_join(data2, by = c('a' = 'b')) %>%
  group_by(row) %>%
  summarise(a = paste0(a, collapse = ":"), 
            c = paste0(c, collapse = ":")) %>%
  select(-row)

#  a                               c          
#  <chr>                           <chr>      
#1 ENSG00000150401:ENSG00000150403 GeneA:GeneB
#2 ENSG00000185294                 GeneC      
#3 NA                              NA

Answer 3

Here is another option with gsubfn这是gsubfn另一个选项

library(gsubfn)
data1$res <- gsubfn("\\w+", setNames(as.list(as.character(data2$c)), 
             data2$b), as.character(data1$a))
data1
#                                a         res
#1 ENSG00000150401:ENSG00000150403 GeneA:GeneB
#2                 ENSG00000185294       GeneC
#3                              NA          NA

In base R , this can be also done by splitting the 'a' column with strsplit and then do match with a named vector created from 'b', 'c' columns of second dataset在base R ，这也可以通过使用strsplit拆分 'a' 列来strsplit ，然后与从第二个数据集的 'b'、'c' 列创建的命名向量进行匹配

is.na(data1$a) <- data1$a == "NA" # converting to real NA instead of character
i1 <- !is.na(data1$a)
# create named vector
v1 <- setNames(as.character(data2$c), data2$b)
data1$res[i1] <- sapply(strsplit(as.character(data1$a[i1]), ":"), 
           function(x) paste(v1[x], collapse=":"))

R通过注释数据框替换列中的单词

问题描述

3 个解决方案

解决方案1
3 已采纳 2020-02-15 09:33:57

解决方案2
0 2020-02-15 14:35:12

解决方案3
0 2020-02-15 17:52:43

R通过注释数据框替换列中的单词

问题描述

3 个解决方案

解决方案1 3 已采纳 2020-02-15 09:33:57

解决方案2 0 2020-02-15 14:35:12

解决方案3 0 2020-02-15 17:52:43

解决方案1
3 已采纳 2020-02-15 09:33:57

解决方案2
0 2020-02-15 14:35:12

解决方案3
0 2020-02-15 17:52:43