簡體   English   中英

使用Perl Regex解析語法樹

[英]Parsing a Syntax Tree with Perl Regex

也許正則表達式不是解析此問題的最佳方法,請告訴我是否不是。 無論如何,這是語法樹的一些示例:

(S (CC and))  
(SBARTMP (IN once) (NP otherstuff))   
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))   

無論如何,我想做的是拔出連接詞(然后是一次,依此類推)及其對應的頭(CC,IN,CC),我已經知道每個語法樹,因此它可以充當錨,並且我還需要檢索其父級(第一個是S,第二個是SBARTMP,第三個是S),以及是否有兄弟姐妹(在第一個兄弟姐妹中,在第二個左側兄弟姐妹中,還有第三個兄弟姐妹)左側和右側同級)。 比父母更高的東西都不包括在內

my $pos = "(\\\w|-)*";  
my $sibling = qr{\s*(\\((?:(?>[^()]+)|(?1))*\\))\s*};  
my $connective = "once";  
my $re = qr{(\(\w*\s*$sibling*\s*\\(IN\s$connective\\)\s*$sibling*\s*\))};  

此代碼適用於以下情況:

my $test1 = "(X (SBAR-TMP (IN once) (S sdf) (S sdf)))";  
my $test2 = "(X (SBAR-TMP (IN once))";  
my $test3 = "(X (SBAR-TMP (IN once) (X as))";  
my $test4 = "(X (SBAR-TMP (X adsf) (IN once))";  

它將把X扔掉,並保留其他所有內容,但是,一旦兄弟姐妹中嵌入了某些東西,則它不匹配,因為正則表達式不會更深入。

my $test = "(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))";  

我不確定該如何解決。 我是Perl擴展模式的新手,剛剛開始學習它。 為了澄清一下正則表達式的功能:它在兩個括號內查找連接詞和大寫字母/-組合,在兩個括號內尋找具有相同格式的完整父級,然后應查找任意數量的同級他們所有的括號都配對了。

要僅獲取與錨定連接詞最接近的“父”,可以將其作為具有FAIL的遞歸父或直接進行。 (由於某種原因,我無法編輯其他帖子,必須刪除Cookie)。

use strict;
use warnings;

my $connective = qr/ \((?:IN|CC)\s(?:once|and|then)\)/x;
my $sibling = qr/
  \s*
  ( 
     (?! $connective )
     \(
        (?:
            (?> (?: [^()]+ ) )
          | (?-1)
        )*
     \)
  )
  \s*
 /x;

my $regex1 = qr/
      \( ( [\w-]+ \s* $sibling* \s* $connective \s* $sibling* ) \) #1
 /x;

my $regex2 = qr/
   ( #1
     \( \s*
        (  #2
           [\w-]+ \s*
           (?>   $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
               | (?1)
           )
        )
        \s*
     \)
   )
 /x;


my $sample = qq/
 (X (SBAR-TMP (IN once) (S sdf) (S sdf)))
 (X (SBAR-TMP (IN once))
 (X (SBAR-TMP (IN once) (X as))
 (X (SBAR-TMP (X adsf) (IN once))
 (X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))
 (S (CC and))  
 (SBARTMP (IN once) (NP otherstuff))   
 (S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
/;

while ($sample =~ /$regex1/xg) {
    print "Found:   $1\n";
}
print '-' x 20, "\n";

while ($sample =~ /$regex2/xg) {
    print "Found:   $2\n";
}

__END__

你為什么要放棄這個,你幾乎做到了。 嘗試這個:

use strict;
use warnings;

 my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
 my $sibling = qr/
  \s*
  ( 
     (?!$connect)
     \(
        (?:
            (?> (?: [^()]+ ) )
          | (?-1)
        )*
     \)
  )
  \s*
 /x;

 my $regex = qr/
   ( #1
     \(
        \s* [\w-]+ \s*
        (?>   $sibling* \s* $connective \s* $sibling*
            | (?1)
        )
      \s*
     \)
   )
 /x;


my @tests = (
  '(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',  
  '(X (SBAR-TMP (IN once))',
  '(X (SBAR-TMP (IN once) (X as))',
  '(X (SBAR-TMP (X adsf) (IN once))',
);

for my $sample (@tests)
{
    while ($sample =~ /$regex/xg) {
         print "Found:   $1\n";
    }
}

my $another =<<EOS;
(S (CC and))  
(SBARTMP (IN once) (NP otherstuff))   
(S
  (S
    (NP blah
      (VP blah)
    )
    (CC then)
    (NP blah
      (VP blah
        (PP blah)
      )
    )
  )
)
EOS

print "\n---------\n";
    while ($another =~ /$regex/xg) {
         print "\nFound:\n$1\n";
    }

結束

這也應該工作

use strict;
use warnings;

my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
my $sibling = qr/
  (?: \s*
  ( 
     (?!$connective)
     \(
        (?:
            (?> (?: [^()]+ ) )
          | (?-1)
        )*
     \)
  )
  \s* )
 /x;

my $regex = qr/
   ( #1
     \( \s*
        (  #2
           [\w-]+ \s*
           (?>   $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
               | (?1)
           )
        )
        \s*
     \)
   )
 /x;


my @tests = (
  '(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',  
  '(X (SBAR-TMP (IN once))',
  '(X (SBAR-TMP (IN once) (X as))',
  '(X (SBAR-TMP (X adsf) (IN once))',
  '(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))',    
);

for my $sample (@tests)
{
    while ($sample =~ /$regex/xg) {
        print "Found:   $2\n";
    }
}

my $another = "
(S (CC and))  
(SBARTMP (IN once) (NP otherstuff))   
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
";

print "\n---------\n";
while ($another =~ /$regex/xg) {
    print "\nFound:\n$2\n";
}

__END__

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM