简体   繁体   中英

Perl regex capture groups and nth occurence

I am learning perl regex, and try to combine capture groups and specifying nth occurence of a string.

Say I have the following:

title="alpha" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma

I want to change the title attribute to the string that follows nth name= , eg sigma , while keeping all the content in between. Also, name= may have double quotes such as name="beta" or name=sigma .

1st occurence of name= :

title="beta" lorem ipsum lorem ipsum Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma

2nd occurence of name= :

title="sigma" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur

I use:

find . -type f -exec perl -pi -w -e 's/(title=)"?[^"\s]*"?(.*) name="?([^"\/]+)"?/$1"$3"$2/' \{\} \;

This works for the first occurence of name= .

I cannot figure how to modify this to specify the nth occurence of name= . I know the basics of specifying nth occurence (such as replace second abc by xyz ), ...

s/abc/ ++$count == 2 ? "xyz" : "abc" /eg

... but have trouble integrating this to my code above. How to specify nth name= and move its following capture group in place of title attribute?

You can use a pattern to set a manual quantifier in the {n} part and optionally repeat key=value pairs to get to the one you are interested in.

(title=)"?[^\s="]+"?(\h+(?:.*?[^\s=]+=[^\s=]+){0}.*?)[^\s=]+="?([^\s="]+)"?\h*
                                              ^^^

The pattern matches:

  • (title=)"?[^\s="]+"? Capture group 1 , match title= and match the value that you don't want to keep after the replacement
  • ( Capture group 2
    • \h+ Match 1+ spaces
    • (?:.*?[^\s=]+=[^\s=]+){0} n times repeat a preceding key=value pair
  • .*? Match any character as least as possible
  • ) Close group 2
  • [^\s=]+= Match 1+ times any char except a whitespace char or = , then match the = for the key part
  • "?([^\s="]+)"? Capture 1+ chars other than a whitespace char = or " in group 3 between optional double quotes
  • \h* Match optional trailing spaces

See a regex demo for 0 repetitions , 1 repetition and 2 repetitions .


Running the pattern in the command for {0} {1} and {2}

find . -type f -exec perl -pi -w -e 's/(title=)"?[^\s="]+"?(\h+(?:.*?[^\s=]+=[^\s=]+){0}.*?)[^\s=]+="?([^\s="]+)"?\h*/$1"$3"$2/' \{\} \;

Changes the line a file to:

title="beta" lorem ipsum lorem ipsum Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma

title="delta" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus Curabitur ullamcorper finibus consectetur name=sigma

title="sigma" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur 

You may use this perl solution:

# 3rd occurrence 
perl -pe 's/(title=)"?[^"\s]*"?((?:.*?\h+name=){3}"?([^"\s]+)"?)/$1"$3"$2/' 

title="sigma" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma

# 2nd occurrence
perl -pe 's/(title=)"?[^"\s]*"?((?:.*?\h+name=){2}"?([^"\s]+)"?)/$1"$3"$2/'

title="delta" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma

# 1st occurrence
perl -pe 's/(title=)"?[^"\s]*"?((?:.*?\h+name=){1}"?([^"\s]+)"?)/$1"$3"$2/'

title="beta" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma

Here (?:.*?\h+name=){N} match N occurrences of sub-pattern that is any text followed by 1+ whitespaces followed by name

Can simplify by doing multiple passes, instead of one conquer-all regex

$N = 1;                          # for the first match
$cnt = 0;                        # silence warnings ($cnt used once)
while (/name="?([^"\s]*)"?/g) { 
    if (++$cnt == $N) {          # get to N-th match 
        $n=$1;                   # store it
        s{name="?$n"?}{};        # remove
        last 
     }   
}; 
s{title=("?\K[^"\s]*)"?}{$n"}    # rewrite title with name

A full example

perl -pwE'        
    $N=shift//1; $cnt = 0;
    while (/name="?([^"\s]*)"?/g) { 
        if (++$cnt == $N) { $n=$1; s{name="?$n"?}{}; last }  
    }; 
    s{title=("?\K[^"\s]*)"?}{$n"}
' file.txt 2

where for testing I use file.txt with the line from the question,

title="alpha" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma

The command-line input 2 makes it seek the second "name." It prints

title="delta" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus  Curabitur ullamcorper finibus consectetur name=sigma

The whole thing can be written on one-line if for some reason needed.


This is inefficient in the sense that we search for a pattern (in while condition) and then again for the substitution (in the body). It is not as bad as it may look since the second pattern is rather straightforward, and it can be optimized if it mattered, but it is two regexes with the same pattern. And then we run yet another to rewrite the title.

The gain is the (comparative) simplicity, where all patterns seek an isolated simple phrase (with name and title ).

Instead of doing everything in one regex, proceed in steps:

perl -lwpe '$n = 2;
            @m=/name="?([^" ]+)"?/g;
            s/title="[^"]+"/title="$m[$n-1]"/;
            s/ name="?\Q$m[$n-1]\E"?//'
  1. Extract all the names into the @m array;
  2. Replace the title by the wanted name;
  3. Remove the name definition.

Note: It's not clear to me why you say sigma is the 2 nd name. I'd say it's the 3 rd one with delta being the 2 nd one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM