Perl正則表達式替換

Question

我還無法弄清楚如何處理特定的正則表達式問題。

說我有一個很大的字符串，在方括號中包含很多短語。 短語標簽（例如S或VP ），令牌（例如w或wSf ），該令牌旁邊的斜杠，然后是令牌的描述（例如CC或VBD_MS3 ）。

所以這是一個示例字符串：

[S w#/CC] [VP mSf/VBD_MS3]

我想刪除整個第一個方括號短語，並將w與第二個短語放在一起，如下所示：

[VP wmSf/VBD_MS3]

使用正則表達式甚至可能嗎？

編輯：好的模式是：

[ <label> w#/<label>] [<label> <word>/<label> <word>/<label> <word>/<label>...]

（第二個括號中的短語可以有一對/任意數量的/對）

其中可以是可能包含下划線的任何大寫字母序列，而單詞可以是非空格的任何序列（即數字/字符/特殊字符）。

Answer 1

是，

s|\[S w#/CC\] \[(VP) (mSf/VBD_MS3)\]|[$1 w$2]|;

現在您正在尋找什么模式？

您甚至可以這樣做：

s|\[S (w)#/CC\] \[(VP) (mSf/VBD_MS3)\]|[$2 $1$3]|;

Answer 2

在不知道實際形式或職位的情況下，這些形式之一可能會起作用（未試用）：

s{\\[S (\\w+)#/\\w+\\] (\\[VP )(\\w+/\\w+\\])}{$2$1$3}g
要么
s{\\[(?:S/VP) (\\w+)#/\\w+\\] (\\[(?:S/VP) )(\\w+/\\w+\\])}{$2$1$3}g
要么
s{\\[(?:S/VP)\\s+(\\w+)#/\\w+\\]\\s+(\\[(?:S/VP)\\s+)(\\w+/\\w+\\])}{$2$1$3}g

編輯由於您的編輯已包含此模式
[ <label> w#/<label>] [<label> <word>/<label> <word>/<label> <word>/<label>...]
這樣可以更輕松地提出應該起作用的正則表達式。

祝好運！

use strict;
use warnings;


$/ = undef;

my $data = <DATA>;


my $regex = qr{

      \[\s*                         #= Start of token phrase '['
          (?&label) \s+                 # <label> then whitespace's
          ((?&word))                    # Capture $1 - token word, end grp $1
          [#]/(?&label)                   # '#'/<label>
          \s*
      \]                            #= End of token phrase ']'
      \s*
    (                             # Capture grp $2
      \[\s*                         #= Start of normal phrase '['
          (?&label) \s+                 # <label> then whitespace's
    )                             # End grp $2
    (                             # Capture grp $3
          (?&word)/(?&label)            # First <word>/<label> pair
          (?:                     
             \s+(?&word)/(?&label)      # Optional, many <word>/<label> pair's
          )*                      
          \s*
      \]                            #= End of normal phrase ']'
    )                             # End grp $3

   (?(DEFINE)               ## DEFINE's:
     (?<label> \w+)             # <label> - 1 or more word characters
     (?<word>  [^\s\[\]]+ )     # <word>  - 1 or more NOT whitespace, '[' nor ']'
   )
}x;


$data =~ s/$regex/$2$1$3/g;

print $data;

__DATA__

[S w#/CC] [VP mSf/VBD_MS3]

輸出：
[VP wmSf/VBD_MS3]

編輯2
“如果字符的標簽是PP，並且下一個短語的標簽是NP，則在加入時也將下一個短語的標簽也更改為PP。例如，輸入：[PP w＃/ IN] [NP something / NN]輸出：[PP wsomething / NN]”

當然，無需添加太多新的捕獲組，就可以通過回調來完成。
實際上，有很多方法可以做到這一點，包括正則表達式條件。 我覺得
最簡單的方法是使用回調，在該回調中可以制定所有標簽決策的邏輯。

use strict;
use warnings;


$/ = undef;

my $data = <DATA>;


my $regex = qr{

   ( \[\s*                  # 1 - Token phrase label
         (?&label)         
         \s+
   )
         (                  # 2 - Token word
            (?&word)
         )         
         [#]/(?&label)
         \s*
     \]
     \s*

   ( \[\s*                  # 3 - Normal phrase label
         (?&label)
         \s+
   )
      # insert token word ($2) here
   (                        # 4 - The rest ..
         (?&word)/(?&label)
         (?: \s+ (?&word)/(?&label) )*                      
         \s*
      \]
   )

   (?(DEFINE)               ## DEFINE's:
     (?<label> \w+)             # <label> - 1 or more word characters
     (?<word>  [^\s\[\]]+ )     # <word>  - 1 or more NOT whitespace, '[' nor ']'
   )
}x;


$data =~ s/$regex/ checkLabel($1,$3) ."$2$4"/eg;


sub checkLabel
{
   my ($p1, $p2) = @_;
   if ($p1 =~ /\[\s*PP\s/ && $p2 =~ /(\[\s*)NP(\s)/) {
      return $1.'PP'.$2;
      # To use the formatting of the token label, just 'return $p1;'
   }
   return $p2;
}


print $data;

__DATA__

[PP w#/CC] [ NP     mSf/VBD_MS3]

Answer 3

與其創建一個神奇的正則表達式來完成整個工作，不如將行分隔為短語，對其進行運算然后返回它們。 然后，這遵循您剛才解釋的相同邏輯。

這樣便更簡潔，可讀性更好（尤其是添加注釋時）並且健壯。 當然，您將需要根據自己的需要進行調整：例如，您可能希望將/分開的部分分成鍵/值對（順序是否重要？如果不使用hashref的話）； 如果您永遠不需要修改標簽，則可能不需要拆分/ ； 等等

按注釋編輯：這在#之前使用文字w ，將其存儲，刪除詞組，然后將w附加到下一個詞組。 如果那是您所需要的，那就來吧。 當然，我確定需要注意一些極端情況，因此請先進行備份和測試！

#!/usr/bin/env perl

use strict;
use warnings;

while( my $line = <DATA> ) {
  #separate phrases, then split phases into whitespace separated pieces
  my @phrases = map { [split /[\s]/] } ($line =~ /\[([^]]+)\]/g);

  my $holder; # holder for 'w' (not really needed if always 'w')
  foreach my $p (@phrases) { # for each phrase
    if ($p->[1] =~ /(w)#/) { # if the second part has 'w#'
      $holder = $1; # keep the 'w' in holder
      $p = undef; #empty to mark for cleaning later
      next; #move to next phrase
    }

    if ($holder) { #if the holder is not empty
      $p->[1] = $holder . $p->[1]; # add the contents of the holder to the second part of this phrase
      $holder = undef; # and then empty the holder
    }
  }

  #remove emptied phrases
  @phrases = grep { $_ } @phrases;

  #reconstitute the line
  print join( ' ', map { '[' . join(' ', @$_) . ']' } @phrases), "\n";
}

__DATA__
[S w#/CC] [VP mSf/VBD_MS3]

再說一次，用一個正則表達式可以做的事似乎很令人驚訝，但是如果老板進來並說：“您知道，您寫來做X的東西效果很好，但是現在它也需要做Y”，會發生什么。 這就是為什么我喜歡對每個邏輯步驟都保持完全獨立的邏輯。

Answer 4

#/usr/bin/env perl
use strict;
use warnings;
my $str = "[S w#/CC] [VP mSf/VBD_MS3]";
$str =~ s{\[S w#/CC\]\s*(\[VP\s)(.+)}{$1w$2} and print $str;

Perl正則表達式替換

問題描述

4 個解決方案

解決方案1
1 2011-11-15 18:03:50

解決方案2
1 已采納

解決方案3
1 2011-11-15 19:38:32

解決方案4
0 2011-11-15 18:07:55

Perl正則表達式替換

問題描述

4 個解決方案

解決方案1 1 2011-11-15 18:03:50

解決方案2 1 已采納

解決方案3 1 2011-11-15 19:38:32

解決方案4 0 2011-11-15 18:07:55

解決方案1
1 2011-11-15 18:03:50

解決方案2
1 已采納

解決方案3
1 2011-11-15 19:38:32

解決方案4
0 2011-11-15 18:07:55