简体   繁体   中英

grep substring following a symbol

I would like to make a new column that contains the string following the last ; symbol in the column ID . I know how to do is using awk, but not in R.

> head(Mapped2)
                                              IsomiR                                                               ID
1                  TCCCGGGTGGTCTAGTGGTTAGGATTCGGCGCT                                   URS0000635088;tRNA-Glu-CTC-2-1
2                  TCCCGGGTGGTCTAGTGGTTAGGATTCGGCGCT                                           URS000011CFE8;misc_RNA
3                  TCCCGGGTGGTCTAGTGGTTAGGATTCGGCGCT                                  URS00006A26A3;Homo;sapiens;tRNA
4 TTGCCCTCGGCCGATCGAAAGGGAGTCGGGTTCAGATCCCCGAATCCGGA                    URS00008D20CE;Homo;sapiens;large;subunit;rRNA
5 TTGCCCTCGGCCGATCGAAAGGGAGTCGGGTTCAGATCCCCGAATCCGGA                    URS00008C7E99;Homo;sapiens;large;subunit;rRNA
6 TTGCCCTCGGCCGATCGAAAGGGAGTCGGGTTCAGATCCCCGAATCCGGA URS000075EC78;Homo;sapiens;RNA,;28S;ribosomal;5;(RNA28S5),;rRNA.

How about a pattern that matches non- ; characters between a ; and the end of the string, like this:

s <- "6TTGCCCTCGGCCGATCGAAAGGGAGTCGGGTTCAGATCCCCGAATCCGGAURS000075EC78;Homo;sapiens;RNA,;28S;ribosomal;5;(RNA28S5),;rRNA."
gsub(".*;([^;]+)$", "\\1", s)
# [1] "rRNA."

Working example:

d <- structure(list(ID = structure(c(2L, 1L, 3L, 6L, 5L, 4L), .Label = c("URS000011CFE8;misc_RNA", "URS0000635088;tRNA-Glu-CTC-2-1", "URS00006A26A3;Homo;sapiens;tRNA", "URS000075EC78;Homo;sapiens;RNA,;28S;ribosomal;5;(RNA28S5),;rRNA.", "URS00008C7E99;Homo;sapiens;large;subunit;rRNA", "URS00008D20CE;Homo;sapiens;large;subunit;rRNA"), class = "factor")), .Names = "ID", class = "data.frame", row.names = c(NA, -6L))

d$newcol <- gsub(".*;([^;]+)$", "\\1", d$ID)

d
#                                                                 ID           newcol
# 1                                   URS0000635088;tRNA-Glu-CTC-2-1 tRNA-Glu-CTC-2-1
# 2                                           URS000011CFE8;misc_RNA         misc_RNA
# 3                                  URS00006A26A3;Homo;sapiens;tRNA             tRNA
# 4                    URS00008D20CE;Homo;sapiens;large;subunit;rRNA             rRNA
# 5                    URS00008C7E99;Homo;sapiens;large;subunit;rRNA             rRNA
# 6 URS000075EC78;Homo;sapiens;RNA,;28S;ribosomal;5;(RNA28S5),;rRNA.            rRNA.

If you want to capture the last occurrence of ; , you can use a greedy operator to capture everything before it (including) and remove it while leaving only what's left, eg

sub(".*;" , "", Mapped2$ID)
# [1] "tRNA-Glu-CTC-2-1" "misc_RNA" "tRNA" "rRNA" "rRNA" "rRNA."          

Given grep uses regexs, here's a regex that works for me: /;([^\\;]*)\\n/g

See this regex demo for implementaiton.

I don't know R, unfortunately, but hopefully that can get you started using grep to that end.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM