简体   繁体   中英

Match from last occurrence using regex in perl

I have a text like this:

hello world /* select a from table_b
*/ some other text with new line cha
racter and there are some blocks of 
/* any string */ select this part on
ly 
////RESULT rest string

The text is multilined and I need to extract from last occurrence of "*/" until "////RESULT". In this case, the result should be:

 select this part on
ly 

How to achieve this in perl?

I have attempted \\\\\\*/(.|\\n)*////RESULT but that will start from first "*/"

A useful trick in cases like this is to prefix the regexp with the greedy pattern .* , which will try to match as many characters as possible before the rest of the pattern matches. So:

my ($match) = ($string =~ m!^.*\*/(.*?)////RESULT!s);

Let's break this pattern into its components:

  • ^.* starts at the beginning of the string and matches as many characters as it can. (The s modifier allows . to match even newlines.) The beginning-of-string anchor ^ is not strictly necessary, but it ensures that the regexp engine won't waste too much time backtracking if the match fails.

  • \\*/ just matches the literal string */ .

  • (.*?) matches and captures any number of characters; the ? makes it ungreedy, so it prefers to match as few characters as possible in case there's more than one position where the rest of the regexp can match.

  • Finally, ////RESULT just matches itself.

Since the pattern contains a lot of slashes, and since I wanted to avoid leaning toothpick syndrome , I decided to use alternative regexp delimiters. Exclamation points ( ! ) are a popular choice, since they don't collide with any normal regexp syntax.


Edit: Per discussion with ikegami below, I guess I should note that, if you want to use this regexp as a sub-pattern in a longer regexp, and if you want to guarantee that the string matched by (.*?) will never contain ////RESULT , then you should wrap those parts of the regexp in an independent (?>) subexpression , like this:

my $regexp = qr!\*/(?>(.*?)////RESULT)!s;
...
my $match = ($string =~ /^.*$regexp$some_other_regexp/s);

The (?>) causes the pattern inside it to fail rather than accepting a suboptimal match (ie one that extends beyond the first substring matching ////RESULT ) even if that means that the rest of the regexp will fail to match.

(?:(?!STRING).)*

matches any number of characters that don't contain STRING . It's like [^a] , but for strings instead of characters.

You can take shortcuts if you know certain inputs won't be encountered (like Kenosis and Ilmari Karonen did), but this is what what matches what you specified:

my ($segment) = $string =~ m{
    \*/
    ( (?: (?! \*/ ). )* )
    ////RESULT
    (?: (?! \*/ ). )*
    \z
}xs;

If you don't care if */ appears after ////RESULT , the following is the safest:

my ($segment) = $string =~ m{
    \*/
    ( (?: (?! \*/ ). )* )
    ////RESULT
}xs;

You didn't specify what should happen if there are two ////RESULT that follow the last */ . The above matches until the last one. If you wanted to match until the first one, you'd use

my ($segment) = $string =~ m{
    \*/
    ( (?: (?! \*/ | ////RESULT ). )* )
    ////RESULT
}xs;

Here's one option:

use strict;
use warnings;

my $string = <<'END';
hello world /* select a from table_b
*/ some other text with new line cha
racter and there are some blocks of 
/* any string */ select this part on
ly 
////RESULT
END

my ($segment) = $string =~ m!\*/([^/]+)////RESULT$!s;

print $segment;

Output:

 select this part on
ly 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM