简体   繁体   中英

perl regular expressions replacement

I haven't been able to figure out how to deal with a specific regex problem.

Say I have the a big string that consists of lots of phrases in square brackets. A phrase label (eg S or VP ), a token (eg w or wSf ), a slash next to that token and then the token's description, (eg CC or VBD_MS3 ).

So here's an example string:

[S w#/CC] [VP mSf/VBD_MS3]

I want to delete the whole first bracketed phrase and put the w inside of it with the second phrase, like this:

[VP wmSf/VBD_MS3]

Is that even possible using regular expressions?


Edit: Okay the pattern is:

[ <label> w#/<label>] [<label> <word>/<label> <word>/<label> <word>/<label>...]

(the second bracketed phrase could have one to any number of / pairs)

where can be any sequence of capital letters that might include an underscore, and word can a sequence of anything that's not whitespace (ie digits/characters/special characters).

Yes,

s|\[S w#/CC\] \[(VP) (mSf/VBD_MS3)\]|[$1 w$2]|;

Now what patterns are you looking for?

You could even do this:

s|\[S (w)#/CC\] \[(VP) (mSf/VBD_MS3)\]|[$2 $1$3]|;

Without knowing the actual form or positions, one of these forms might work (untested):

s{\\[S (\\w+)#/\\w+\\] (\\[VP )(\\w+/\\w+\\])}{$2$1$3}g
or
s{\\[(?:S/VP) (\\w+)#/\\w+\\] (\\[(?:S/VP) )(\\w+/\\w+\\])}{$2$1$3}g
or
s{\\[(?:S/VP)\\s+(\\w+)#/\\w+\\]\\s+(\\[(?:S/VP)\\s+)(\\w+/\\w+\\])}{$2$1$3}g

Edit Since your edit has included this pattern
[ <label> w#/<label>] [<label> <word>/<label> <word>/<label> <word>/<label>...]
it makes it easier to come up with a regex that should work.

Good luck!

use strict;
use warnings;


$/ = undef;

my $data = <DATA>;


my $regex = qr{

      \[\s*                         #= Start of token phrase '['
          (?&label) \s+                 # <label> then whitespace's
          ((?&word))                    # Capture $1 - token word, end grp $1
          [#]/(?&label)                   # '#'/<label>
          \s*
      \]                            #= End of token phrase ']'
      \s*
    (                             # Capture grp $2
      \[\s*                         #= Start of normal phrase '['
          (?&label) \s+                 # <label> then whitespace's
    )                             # End grp $2
    (                             # Capture grp $3
          (?&word)/(?&label)            # First <word>/<label> pair
          (?:                     
             \s+(?&word)/(?&label)      # Optional, many <word>/<label> pair's
          )*                      
          \s*
      \]                            #= End of normal phrase ']'
    )                             # End grp $3

   (?(DEFINE)               ## DEFINE's:
     (?<label> \w+)             # <label> - 1 or more word characters
     (?<word>  [^\s\[\]]+ )     # <word>  - 1 or more NOT whitespace, '[' nor ']'
   )
}x;


$data =~ s/$regex/$2$1$3/g;

print $data;

__DATA__

[S w#/CC] [VP mSf/VBD_MS3]

Output:
[VP wmSf/VBD_MS3]

Edit2
"if the label of the character is PP, and if the next phrase's label is NP, then change the next phrase's label to PP as well when joining. eg. input: [PP w#/IN] [NP something/NN] output: [PP wsomething/NN]"

Sure, without adding too many new capture groups, it can be done with a callback.
Actually, there are many ways to do this, including regex conditionals. I think the
simplest method is with a callback, where the logic for all label decisions can be made.

use strict;
use warnings;


$/ = undef;

my $data = <DATA>;


my $regex = qr{

   ( \[\s*                  # 1 - Token phrase label
         (?&label)         
         \s+
   )
         (                  # 2 - Token word
            (?&word)
         )         
         [#]/(?&label)
         \s*
     \]
     \s*

   ( \[\s*                  # 3 - Normal phrase label
         (?&label)
         \s+
   )
      # insert token word ($2) here
   (                        # 4 - The rest ..
         (?&word)/(?&label)
         (?: \s+ (?&word)/(?&label) )*                      
         \s*
      \]
   )

   (?(DEFINE)               ## DEFINE's:
     (?<label> \w+)             # <label> - 1 or more word characters
     (?<word>  [^\s\[\]]+ )     # <word>  - 1 or more NOT whitespace, '[' nor ']'
   )
}x;


$data =~ s/$regex/ checkLabel($1,$3) ."$2$4"/eg;


sub checkLabel
{
   my ($p1, $p2) = @_;
   if ($p1 =~ /\[\s*PP\s/ && $p2 =~ /(\[\s*)NP(\s)/) {
      return $1.'PP'.$2;
      # To use the formatting of the token label, just 'return $p1;'
   }
   return $p2;
}


print $data;

__DATA__

[PP w#/CC] [ NP     mSf/VBD_MS3]

Rather than create a magic regex to do the whole job, why not separate the line into phrases, operate on them then return them. This then follows the same logic that you just explained.

This then cleaner, more readable (especially if you add comments) and robust. Of course you will need to tailor to your needs: for example you may want to make the / separated portions into key/value pairs (does the order matter? if not make a hashref); perhaps you don't need to split on / if you never need to modify the label; etc.

Edit per comments: This takes a literal w before a # , stores it, removes the phrase, then tacks the w onto the next phrase. If thats what you need then have at it. Of course I'm sure there are edge cases to look out for, so backup and test first!

#!/usr/bin/env perl

use strict;
use warnings;

while( my $line = <DATA> ) {
  #separate phrases, then split phases into whitespace separated pieces
  my @phrases = map { [split /[\s]/] } ($line =~ /\[([^]]+)\]/g);

  my $holder; # holder for 'w' (not really needed if always 'w')
  foreach my $p (@phrases) { # for each phrase
    if ($p->[1] =~ /(w)#/) { # if the second part has 'w#'
      $holder = $1; # keep the 'w' in holder
      $p = undef; #empty to mark for cleaning later
      next; #move to next phrase
    }

    if ($holder) { #if the holder is not empty
      $p->[1] = $holder . $p->[1]; # add the contents of the holder to the second part of this phrase
      $holder = undef; # and then empty the holder
    }
  }

  #remove emptied phrases
  @phrases = grep { $_ } @phrases;

  #reconstitute the line
  print join( ' ', map { '[' . join(' ', @$_) . ']' } @phrases), "\n";
}

__DATA__
[S w#/CC] [VP mSf/VBD_MS3]

Again, it may seem amazing what you can do with one regex, but what happens if your boss comes in and says, "you know, that thing you wrote to do X works great, but now it needs to do Y too". This is why I like to keep nicely separate logic for each logical step.

#/usr/bin/env perl
use strict;
use warnings;
my $str = "[S w#/CC] [VP mSf/VBD_MS3]";
$str =~ s{\[S w#/CC\]\s*(\[VP\s)(.+)}{$1w$2} and print $str;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM