I am learning perl regex, and try to combine capture groups and specifying nth occurence of a string.
Say I have the following:
title="alpha" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma
I want to change the title
attribute to the string that follows nth name=
, eg sigma
, while keeping all the content in between. Also, name=
may have double quotes such as name="beta"
or name=sigma
.
1st occurence of name=
:
title="beta" lorem ipsum lorem ipsum Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma
2nd occurence of name=
:
title="sigma" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur
I use:
find . -type f -exec perl -pi -w -e 's/(title=)"?[^"\s]*"?(.*) name="?([^"\/]+)"?/$1"$3"$2/' \{\} \;
This works for the first occurence of name=
.
I cannot figure how to modify this to specify the nth occurence of name=
. I know the basics of specifying nth occurence (such as replace second abc
by xyz
), ...
s/abc/ ++$count == 2 ? "xyz" : "abc" /eg
... but have trouble integrating this to my code above. How to specify nth name=
and move its following capture group in place of title
attribute?
You can use a pattern to set a manual quantifier in the {n}
part and optionally repeat key=value pairs to get to the one you are interested in.
(title=)"?[^\s="]+"?(\h+(?:.*?[^\s=]+=[^\s=]+){0}.*?)[^\s=]+="?([^\s="]+)"?\h*
^^^
The pattern matches:
(title=)"?[^\s="]+"?
Capture group 1 , match title=
and match the value that you don't want to keep after the replacement (
Capture group 2
\h+
Match 1+ spaces (?:.*?[^\s=]+=[^\s=]+){0}
n times repeat a preceding key=value pair .*?
Match any character as least as possible)
Close group 2 [^\s=]+=
Match 1+ times any char except a whitespace char or =
, then match the =
for the key part "?([^\s="]+)"?
Capture 1+ chars other than a whitespace char =
or "
in group 3 between optional double quotes \h*
Match optional trailing spaces See a regex demo for 0 repetitions , 1 repetition and 2 repetitions .
Running the pattern in the command for {0}
{1}
and {2}
find . -type f -exec perl -pi -w -e 's/(title=)"?[^\s="]+"?(\h+(?:.*?[^\s=]+=[^\s=]+){0}.*?)[^\s=]+="?([^\s="]+)"?\h*/$1"$3"$2/' \{\} \;
Changes the line a file to:
title="beta" lorem ipsum lorem ipsum Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma
title="delta" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus Curabitur ullamcorper finibus consectetur name=sigma
title="sigma" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur
You may use this perl
solution:
# 3rd occurrence
perl -pe 's/(title=)"?[^"\s]*"?((?:.*?\h+name=){3}"?([^"\s]+)"?)/$1"$3"$2/'
title="sigma" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma
# 2nd occurrence
perl -pe 's/(title=)"?[^"\s]*"?((?:.*?\h+name=){2}"?([^"\s]+)"?)/$1"$3"$2/'
title="delta" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma
# 1st occurrence
perl -pe 's/(title=)"?[^"\s]*"?((?:.*?\h+name=){1}"?([^"\s]+)"?)/$1"$3"$2/'
title="beta" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma
Here (?:.*?\h+name=){N}
match N
occurrences of sub-pattern that is any text followed by 1+ whitespaces followed by name
Can simplify by doing multiple passes, instead of one conquer-all regex
$N = 1; # for the first match
$cnt = 0; # silence warnings ($cnt used once)
while (/name="?([^"\s]*)"?/g) {
if (++$cnt == $N) { # get to N-th match
$n=$1; # store it
s{name="?$n"?}{}; # remove
last
}
};
s{title=("?\K[^"\s]*)"?}{$n"} # rewrite title with name
A full example
perl -pwE'
$N=shift//1; $cnt = 0;
while (/name="?([^"\s]*)"?/g) {
if (++$cnt == $N) { $n=$1; s{name="?$n"?}{}; last }
};
s{title=("?\K[^"\s]*)"?}{$n"}
' file.txt 2
where for testing I use file.txt
with the line from the question,
title="alpha" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus name=delta Curabitur ullamcorper finibus consectetur name=sigma
The command-line input 2
makes it seek the second "name." It prints
title="delta" lorem ipsum lorem ipsum name="beta" Morbi posuere metus purus Curabitur ullamcorper finibus consectetur name=sigma
The whole thing can be written on one-line if for some reason needed.
This is inefficient in the sense that we search for a pattern (in while
condition) and then again for the substitution (in the body). It is not as bad as it may look since the second pattern is rather straightforward, and it can be optimized if it mattered, but it is two regexes with the same pattern. And then we run yet another to rewrite the title.
The gain is the (comparative) simplicity, where all patterns seek an isolated simple phrase (with name
and title
).
Instead of doing everything in one regex, proceed in steps:
perl -lwpe '$n = 2;
@m=/name="?([^" ]+)"?/g;
s/title="[^"]+"/title="$m[$n-1]"/;
s/ name="?\Q$m[$n-1]\E"?//'
Note: It's not clear to me why you say sigma
is the 2 nd name. I'd say it's the 3 rd one with delta
being the 2 nd one.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.