简体   繁体   中英

Perl regex capture groups and reshuffle pattern

I use perl regex capture groups to replace the pattern of a large number of files.

File example 1:

title="alpha" lorem ipsum lorem ipsum name="beta"

File example 2:

title="omega" Morbi posuere metus purus name="delta"

for

title="beta" lorem ipsum lorem ipsum
title="delta" Morbi posuere metus purus

using

find . -type f -exec perl -pi -w -e 's/title="(?'one'.*?)"(?'three'.*?)name="(?'two'.*?)"/title="\g{two}"\g{three}/g;' \{\} \;

(Note that (1) attribute values of title and name are unknown variables and (2) the content between title="alpha" and name="beta" differs. )

I am still learning perl regex. What am I doing wrong? .

This perl command line should work:

perl -pe 's/(title=)"?[^"\s]*"?(.*) name="?([^"\s]+)"?/$1"$3"$2/' file

title="beta" lorem ipsum lorem ipsum
title="delta" Morbi posuere metus purus

Explanation:

  • (title=) : Match title= and capture in group #1
  • "?[^"\s]+"? : Match a quoted non-space string
  • (.*) : Match 0 or more of any chars and capture in group #2
  • name="? : Match name= text followed by optional "
  • ([^"\s]+) : Match a quoted non-space string and capture in group #3
  • "? : Optional "
  • $1"$3"$2 : Replacement part

RegEx Demo

A bit of syntax: capture with (?<name>pattern) and then use with $+{name} (delimiters may be varied, see it in perlre ) outside of the pattern. The whole regex

s{ title="(?<t>[^"]+)" (?<text>.*?) name="(?<n>[^"]+)" }
 {title="$+{n}"$+{text}}x

The \g{name} syntax attempted in the question is used inside the pattern itself (if it is needed further in the same pattern in which it first gets captured); but after the matching side, so in the replacement side or after the regex, the matches can be retrieved from the %+ variable .

The [^"] is a negated character-class , matching any character other than " . The modifier /x at the end makes it ignore literal spaces inside so we can use them for readability.

A full example, with the above regex, to run on the command line

echo title=\"alpha\" lorem ipsum lorem ipsum name=\"beta\" | perl -wpe
's{title="(?<t>[^"]+)"(?<text>.*?)name="(?<n>[^"]+)"}{title="$+{n}"$+{text}}'

(broken into two lines for readability). It prints

title="beta" lorem ipsum lorem ipsum 

Not sure what the first one need be captured for, as in the question, but perhaps there is more to it than shown so it is captured here as well, into $+{t} .

Also, the question uses those quotes rather loosely. One can string together ' -delimited strings for one command-line program but I'd suggest not to (if that was the intent).

1st solution: Since you are using find command of shell, so in case you are ok with awk code, here it goes, written and tested in GNU awk .

Here is the Online demo for used regex in following code.

awk -v s1="\"" '
match($0,/(title=)"[^"]*" (.*)name="([^"]*)"/,arr){
  print arr[1] s1 arr[3] s1,arr[2]
}
'  Input_file

Explanation: Simple explanation here would be using GNU awk 's match function; which allows us to use regex in it to find the required output. In here I am using regex (title=)"[^"]*" (.*)name="([^"]*)" which is creating 3 capturing groups, whose values are getting stored into array named arr with index of ``1,2,3 with values of captured groups values. Then while printing the values I am printing them as per required output by OP.



2nd solution: In sed with same regex and -E (ERE) enabled option please try following code.

sed -E 's/^(title=)"[^"]*" (.*)name="([^"]*)"/\1"\3" \2/' Input_file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM