简体   繁体   English

Perl正则表达式替换

[英]perl regular expressions replacement

I haven't been able to figure out how to deal with a specific regex problem. 我还无法弄清楚如何处理特定的正则表达式问题。

Say I have the a big string that consists of lots of phrases in square brackets. 说我有一个很大的字符串,在方括号中包含很多短语。 A phrase label (eg S or VP ), a token (eg w or wSf ), a slash next to that token and then the token's description, (eg CC or VBD_MS3 ). 短语标签(例如SVP ),令牌(例如wwSf ),该令牌旁边的斜杠,然后是令牌的描述(例如CCVBD_MS3 )。

So here's an example string: 所以这是一个示例字符串:

[S w#/CC] [VP mSf/VBD_MS3]

I want to delete the whole first bracketed phrase and put the w inside of it with the second phrase, like this: 我想删除整个第一个方括号短语,并将w与第二个短语放在一起,如下所示:

[VP wmSf/VBD_MS3]

Is that even possible using regular expressions? 使用正则表达式甚至可能吗?


Edit: Okay the pattern is: 编辑:好的模式是:

[ <label> w#/<label>] [<label> <word>/<label> <word>/<label> <word>/<label>...]

(the second bracketed phrase could have one to any number of / pairs) (第二个括号中的短语可以有一对/任意数量的/对)

where can be any sequence of capital letters that might include an underscore, and word can a sequence of anything that's not whitespace (ie digits/characters/special characters). 其中可以是可能包含下划线的任何大写字母序列,而单词可以是非空格的任何序列(即数字/字符/特殊字符)。

Yes, 是,

s|\[S w#/CC\] \[(VP) (mSf/VBD_MS3)\]|[$1 w$2]|;

Now what patterns are you looking for? 现在您正在寻找什么模式

You could even do this: 您甚至可以这样做:

s|\[S (w)#/CC\] \[(VP) (mSf/VBD_MS3)\]|[$2 $1$3]|;

Without knowing the actual form or positions, one of these forms might work (untested): 在不知道实际形式或职位的情况下,这些形式之一可能会起作用(未试用):

s{\\[S (\\w+)#/\\w+\\] (\\[VP )(\\w+/\\w+\\])}{$2$1$3}g
or 要么
s{\\[(?:S/VP) (\\w+)#/\\w+\\] (\\[(?:S/VP) )(\\w+/\\w+\\])}{$2$1$3}g
or 要么
s{\\[(?:S/VP)\\s+(\\w+)#/\\w+\\]\\s+(\\[(?:S/VP)\\s+)(\\w+/\\w+\\])}{$2$1$3}g

Edit Since your edit has included this pattern 编辑由于您的编辑已包含此模式
[ <label> w#/<label>] [<label> <word>/<label> <word>/<label> <word>/<label>...]
it makes it easier to come up with a regex that should work. 这样可以更轻松地提出应该起作用的正则表达式。

Good luck! 祝好运!

use strict;
use warnings;


$/ = undef;

my $data = <DATA>;


my $regex = qr{

      \[\s*                         #= Start of token phrase '['
          (?&label) \s+                 # <label> then whitespace's
          ((?&word))                    # Capture $1 - token word, end grp $1
          [#]/(?&label)                   # '#'/<label>
          \s*
      \]                            #= End of token phrase ']'
      \s*
    (                             # Capture grp $2
      \[\s*                         #= Start of normal phrase '['
          (?&label) \s+                 # <label> then whitespace's
    )                             # End grp $2
    (                             # Capture grp $3
          (?&word)/(?&label)            # First <word>/<label> pair
          (?:                     
             \s+(?&word)/(?&label)      # Optional, many <word>/<label> pair's
          )*                      
          \s*
      \]                            #= End of normal phrase ']'
    )                             # End grp $3

   (?(DEFINE)               ## DEFINE's:
     (?<label> \w+)             # <label> - 1 or more word characters
     (?<word>  [^\s\[\]]+ )     # <word>  - 1 or more NOT whitespace, '[' nor ']'
   )
}x;


$data =~ s/$regex/$2$1$3/g;

print $data;

__DATA__

[S w#/CC] [VP mSf/VBD_MS3]

Output: 输出:
[VP wmSf/VBD_MS3]

Edit2 编辑2
"if the label of the character is PP, and if the next phrase's label is NP, then change the next phrase's label to PP as well when joining. eg. input: [PP w#/IN] [NP something/NN] output: [PP wsomething/NN]" “如果字符的标签是PP,并且下一个短语的标签是NP,则在加入时也将下一个短语的标签也更改为PP。例如,输入:[PP w#/ IN] [NP something / NN]输出:[PP wsomething / NN]”

Sure, without adding too many new capture groups, it can be done with a callback. 当然,无需添加太多新的捕获组,就可以通过回调来完成。
Actually, there are many ways to do this, including regex conditionals. 实际上,有很多方法可以做到这一点,包括正则表达式条件。 I think the 我觉得
simplest method is with a callback, where the logic for all label decisions can be made. 最简单的方法是使用回调,在该回调中可以制定所有标签决策的逻辑。

use strict;
use warnings;


$/ = undef;

my $data = <DATA>;


my $regex = qr{

   ( \[\s*                  # 1 - Token phrase label
         (?&label)         
         \s+
   )
         (                  # 2 - Token word
            (?&word)
         )         
         [#]/(?&label)
         \s*
     \]
     \s*

   ( \[\s*                  # 3 - Normal phrase label
         (?&label)
         \s+
   )
      # insert token word ($2) here
   (                        # 4 - The rest ..
         (?&word)/(?&label)
         (?: \s+ (?&word)/(?&label) )*                      
         \s*
      \]
   )

   (?(DEFINE)               ## DEFINE's:
     (?<label> \w+)             # <label> - 1 or more word characters
     (?<word>  [^\s\[\]]+ )     # <word>  - 1 or more NOT whitespace, '[' nor ']'
   )
}x;


$data =~ s/$regex/ checkLabel($1,$3) ."$2$4"/eg;


sub checkLabel
{
   my ($p1, $p2) = @_;
   if ($p1 =~ /\[\s*PP\s/ && $p2 =~ /(\[\s*)NP(\s)/) {
      return $1.'PP'.$2;
      # To use the formatting of the token label, just 'return $p1;'
   }
   return $p2;
}


print $data;

__DATA__

[PP w#/CC] [ NP     mSf/VBD_MS3]

Rather than create a magic regex to do the whole job, why not separate the line into phrases, operate on them then return them. 与其创建一个神奇的正则表达式来完成整个工作,不如将行分隔为短语,对其进行运算然后返回它们。 This then follows the same logic that you just explained. 然后,这遵循您刚才解释的相同逻辑。

This then cleaner, more readable (especially if you add comments) and robust. 这样便更简洁,可读性更好(尤其是添加注释时)并且健壮。 Of course you will need to tailor to your needs: for example you may want to make the / separated portions into key/value pairs (does the order matter? if not make a hashref); 当然,您将需要根据自己的需要进行调整:例如,您可能希望将/分开的部分分成键/值对(顺序是否重要?如果不使用hashref的话); perhaps you don't need to split on / if you never need to modify the label; 如果您永远不需要修改标签,则可能不需要拆分/ etc. 等等

Edit per comments: This takes a literal w before a # , stores it, removes the phrase, then tacks the w onto the next phrase. 按注释编辑:这在#之前使用文字w ,将其存储,删除词组,然后将w附加到下一个词组。 If thats what you need then have at it. 如果那是您所需要的,那就来吧。 Of course I'm sure there are edge cases to look out for, so backup and test first! 当然,我确定需要注意一些极端情况,因此请先进行备份和测试!

#!/usr/bin/env perl

use strict;
use warnings;

while( my $line = <DATA> ) {
  #separate phrases, then split phases into whitespace separated pieces
  my @phrases = map { [split /[\s]/] } ($line =~ /\[([^]]+)\]/g);

  my $holder; # holder for 'w' (not really needed if always 'w')
  foreach my $p (@phrases) { # for each phrase
    if ($p->[1] =~ /(w)#/) { # if the second part has 'w#'
      $holder = $1; # keep the 'w' in holder
      $p = undef; #empty to mark for cleaning later
      next; #move to next phrase
    }

    if ($holder) { #if the holder is not empty
      $p->[1] = $holder . $p->[1]; # add the contents of the holder to the second part of this phrase
      $holder = undef; # and then empty the holder
    }
  }

  #remove emptied phrases
  @phrases = grep { $_ } @phrases;

  #reconstitute the line
  print join( ' ', map { '[' . join(' ', @$_) . ']' } @phrases), "\n";
}

__DATA__
[S w#/CC] [VP mSf/VBD_MS3]

Again, it may seem amazing what you can do with one regex, but what happens if your boss comes in and says, "you know, that thing you wrote to do X works great, but now it needs to do Y too". 再说一次,用一个正则表达式可以做的事似乎很令人惊讶,但是如果老板进来并说:“您知道,您写来做X的东西效果很好,但是现在它也需要做Y”,会发生什么。 This is why I like to keep nicely separate logic for each logical step. 这就是为什么我喜欢对每个逻辑步骤都保持完全独立的逻辑。

#/usr/bin/env perl
use strict;
use warnings;
my $str = "[S w#/CC] [VP mSf/VBD_MS3]";
$str =~ s{\[S w#/CC\]\s*(\[VP\s)(.+)}{$1w$2} and print $str;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM