How to remove part of string using R regex with boundary

Question

I have these 3 example strings:

x <- "AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer(0.989)More Information | Similar Motifs Found"
y <- "NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer(0.828)More Information | Similar Motifs Found"
z <- "SPIB/MA0081.1/Jaspar(0.753)More Information | Similar Motifs Found"

What I want to do is to remove strings that comes after first word of the last / delimiter resulting in:

AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer
NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer
SPIB/MA0081.1/Jaspar

I tried this but it doesn't give what I want:

> sub("\\(.*?\\)More Information | Similar Motifs Found","",x)
[1] "AP-1| Similar Motifs Found"

What's the right way to do it?

Answer 1

You can use a greedy pattern (.*/\\\\w+).* to match until the last /word , then extract the group with back reference:

v <- c("AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer(0.989)More Information | Similar Motifs Found", "NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer(0.828)More Information | Similar Motifs Found", "SPIB/MA0081.1/Jaspar(0.753)More Information | Similar Motifs Found")

sub("(.*/\\w+).*", "\\1", v)
# [1] "AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer"          "NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer"
# [3] "SPIB/MA0081.1/Jaspar"

In (.*/\\\\w+).* , the first .* is greedy and will match as many as possible, the stop condition is / + a word (matched by \\\\w+ ); the second .* matches the remaining part of the string.

How to remove part of string using R regex with boundary

Question

1 answers

solution1
1 ACCPTED 2017-11-17 01:19:37

How to remove part of string using R regex with boundary

Question

1 answers

solution1 1 ACCPTED 2017-11-17 01:19:37

solution1
1 ACCPTED 2017-11-17 01:19:37