简体   繁体   English

在R中提取基因注释ID

[英]Extracting Gene Annotation IDs in R

I have an annotation file and I want to parse out FlyBase transcript IDs to make a new column. 我有一个注释文件,我想解析出FlyBase脚本ID来创建新列。 I've tried regex, but it hasn't worked. 我已经尝试过正则表达式,但是没有用。 Not sure if I just might not be using it correctly. 不知道我是否可能没有正确使用它。 The IDs are either at the beginning or in the middle of the string, which is this case is a collection of IDs from different databases. 这些ID位于字符串的开头或中间,在这种情况下,是来自不同数据库的ID的集合。 There might also be multiple FlyBase IDs in which case I'd like to use a separator like ID1/ID2 . 在这种情况下,我可能想使用多个FlyBase ID,例如ID1/ID2

Example annotation lines: "AY113634 // --- // 100 // 2 // 2 // 0 /// FBtr0089787 // --- // 100 // 2 // 2 // 0" 注释行示例: "AY113634 // --- // 100 // 2 // 2 // 0 /// FBtr0089787 // --- // 100 // 2 // 2 // 0"

"FBtr0079338 // --- // 100 // 15 // 15 // 0 /// FBtr0086326 // --- // 100 // 15 // 15 // 0 /// FBtr0100846 // --- // 100 // 15 // 15 // 0 /// NONDMET000145 // --- // 100 // 15 // 15 // 0 /// NONDMET000970 // --- // 100 // 15 // 15 // 0 /// NONDMET000971 // --- // 100 // 15 // 15 // 0"

I want to create a column that maintains the same order but only contains the FlyBase IDs with separators if necessary. 我想创建一个保持相同顺序的列,但在必要时仅包含带有分隔符的FlyBase ID。 I am working with the data.table package so if there's a solution using data tables that would be much appreciated. 我正在使用data.table包,因此如果有使用数据表的解决方案,将不胜感激。 One idea I have is to use sub , search for [ FBtr][0-9+] (not sure if that's right) and if it doesn't match that pattern then replace it with "" . 我的一个主意是使用sub ,搜索[ FBtr][0-9+] (不确定是否正确),如果它与该模式不匹配,则将其替换为""

Example Table: x <- data.table(probesetID = 1:10, probesetType = rep("main", 10), rep("FBtr0299871 // --- // 100 // FBtr193920 // 3 // 3 // 0", 10)) 示例表: x <- data.table(probesetID = 1:10, probesetType = rep("main", 10), rep("FBtr0299871 // --- // 100 // FBtr193920 // 3 // 3 // 0", 10))

Here is something to get you started, I can update the answer once I have a better idea of what your "data.table" looks like: 这是一些入门的信息,一旦您对“ data.table”的外观有了更好的了解,就可以更新答案:

x <- "FBtr0079338 // --- // 100 // 15 // 15 // 0 /// FBtr0086326 // --- // 100 // 15 // 15 // 0 /// FBtr0100846 // --- // 100 // 15 // 15 // 0 /// NONDMET000145 // --- // 100 // 15 // 15 // 0 /// NONDMET000970 // --- // 100 // 15 // 15 // 0 /// NONDMET000971 // --- // 100 // 15 // 15 // 0"
sapply(strsplit(x, "/+"), function(s) grep("FBtr", trimws(s), value=TRUE))

#     [,1]         
#[1,] "FBtr0079338"
#[2,] "FBtr0086326"
#[3,] "FBtr0100846"

sapply(strsplit(x, "/+"), function(x) paste0(grep("FBtr", trimws(x), value=TRUE), collapse = ";"))
#[1] "FBtr0079338;FBtr0086326;FBtr0100846"

Edit: 编辑:

To assign to a new column in the datatable: 分配给数据表中的新列:

x$FBtr <- sapply(strsplit(x$V3, "/+"), function(x) paste0(grep("FBtr", trimws(x), value=TRUE), collapse = ";"))

In essence you can supply the column containing the annotations inplace of x . 本质上,您可以提供包含x注释的列。

More specific to data.table , and using the stringr package: 更特定于data.table ,并使用stringr包:

library(stringr)
x[, .(IDs = str_c(unlist(str_extract_all(V3, "(FBtr)[0-9]+")), 
    collapse = "/")), by = probesetID]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM