简体   繁体   中英

How to remove part of string using R regex with boundary

I have these 3 example strings:

x <- "AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer(0.989)More Information | Similar Motifs Found"
y <- "NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer(0.828)More Information | Similar Motifs Found"
z <- "SPIB/MA0081.1/Jaspar(0.753)More Information | Similar Motifs Found"

What I want to do is to remove strings that comes after first word of the last / delimiter resulting in:

AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer
NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer
SPIB/MA0081.1/Jaspar

I tried this but it doesn't give what I want:

> sub("\\(.*?\\)More Information | Similar Motifs Found","",x)
[1] "AP-1| Similar Motifs Found"

What's the right way to do it?

You can use a greedy pattern (.*/\\\\w+).* to match until the last /word , then extract the group with back reference:

v <- c("AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer(0.989)More Information | Similar Motifs Found", "NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer(0.828)More Information | Similar Motifs Found", "SPIB/MA0081.1/Jaspar(0.753)More Information | Similar Motifs Found")

sub("(.*/\\w+).*", "\\1", v)
# [1] "AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer"          "NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer"
# [3] "SPIB/MA0081.1/Jaspar" 

In (.*/\\\\w+).* , the first .* is greedy and will match as many as possible, the stop condition is / + a word (matched by \\\\w+ ); the second .* matches the remaining part of the string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM