在R中提取基因注释ID

Question

I have an annotation file and I want to parse out FlyBase transcript IDs to make a new column. 我有一个注释文件，我想解析出FlyBase脚本ID来创建新列。 I've tried regex, but it hasn't worked. 我已经尝试过正则表达式，但是没有用。 Not sure if I just might not be using it correctly. 不知道我是否可能没有正确使用它。 The IDs are either at the beginning or in the middle of the string, which is this case is a collection of IDs from different databases. 这些ID位于字符串的开头或中间，在这种情况下，是来自不同数据库的ID的集合。 There might also be multiple FlyBase IDs in which case I'd like to use a separator like ID1/ID2 . 在这种情况下，我可能想使用多个FlyBase ID，例如ID1/ID2 。

Example annotation lines: "AY113634 // --- // 100 // 2 // 2 // 0 /// FBtr0089787 // --- // 100 // 2 // 2 // 0" 注释行示例： "AY113634 // --- // 100 // 2 // 2 // 0 /// FBtr0089787 // --- // 100 // 2 // 2 // 0"

"FBtr0079338 // --- // 100 // 15 // 15 // 0 /// FBtr0086326 // --- // 100 // 15 // 15 // 0 /// FBtr0100846 // --- // 100 // 15 // 15 // 0 /// NONDMET000145 // --- // 100 // 15 // 15 // 0 /// NONDMET000970 // --- // 100 // 15 // 15 // 0 /// NONDMET000971 // --- // 100 // 15 // 15 // 0"

I want to create a column that maintains the same order but only contains the FlyBase IDs with separators if necessary. 我想创建一个保持相同顺序的列，但在必要时仅包含带有分隔符的FlyBase ID。 I am working with the data.table package so if there's a solution using data tables that would be much appreciated. 我正在使用data.table包，因此如果有使用数据表的解决方案，将不胜感激。 One idea I have is to use sub , search for [ FBtr][0-9+] (not sure if that's right) and if it doesn't match that pattern then replace it with "" . 我的一个主意是使用sub ，搜索[ FBtr][0-9+] （不确定是否正确），如果它与该模式不匹配，则将其替换为"" 。

Example Table: x <- data.table(probesetID = 1:10, probesetType = rep("main", 10), rep("FBtr0299871 // --- // 100 // FBtr193920 // 3 // 3 // 0", 10)) 示例表： x <- data.table(probesetID = 1:10, probesetType = rep("main", 10), rep("FBtr0299871 // --- // 100 // FBtr193920 // 3 // 3 // 0", 10))

Answer 1

Here is something to get you started, I can update the answer once I have a better idea of what your "data.table" looks like: 这是一些入门的信息，一旦您对“ data.table”的外观有了更好的了解，就可以更新答案：

x <- "FBtr0079338 // --- // 100 // 15 // 15 // 0 /// FBtr0086326 // --- // 100 // 15 // 15 // 0 /// FBtr0100846 // --- // 100 // 15 // 15 // 0 /// NONDMET000145 // --- // 100 // 15 // 15 // 0 /// NONDMET000970 // --- // 100 // 15 // 15 // 0 /// NONDMET000971 // --- // 100 // 15 // 15 // 0"
sapply(strsplit(x, "/+"), function(s) grep("FBtr", trimws(s), value=TRUE))

#     [,1]         
#[1,] "FBtr0079338"
#[2,] "FBtr0086326"
#[3,] "FBtr0100846"

sapply(strsplit(x, "/+"), function(x) paste0(grep("FBtr", trimws(x), value=TRUE), collapse = ";"))
#[1] "FBtr0079338;FBtr0086326;FBtr0100846"

Edit: 编辑：

To assign to a new column in the datatable: 分配给数据表中的新列：

x$FBtr <- sapply(strsplit(x$V3, "/+"), function(x) paste0(grep("FBtr", trimws(x), value=TRUE), collapse = ";"))

In essence you can supply the column containing the annotations inplace of x . 本质上，您可以提供包含x注释的列。

Answer 2

More specific to data.table , and using the stringr package: 更特定于data.table ，并使用stringr包：

library(stringr)
x[, .(IDs = str_c(unlist(str_extract_all(V3, "(FBtr)[0-9]+")), 
    collapse = "/")), by = probesetID]

在R中提取基因注释ID

问题描述

2 个解决方案

解决方案1
0 2017-10-12 17:21:30

Edit: 编辑：

解决方案2
0 已采纳 2017-10-12 18:35:35

在R中提取基因注释ID

问题描述

2 个解决方案

解决方案1 0 2017-10-12 17:21:30

Edit: 编辑：

解决方案2 0 已采纳 2017-10-12 18:35:35

解决方案1
0 2017-10-12 17:21:30

解决方案2
0 已采纳 2017-10-12 18:35:35