[英]Parsing a Syntax Tree with Perl Regex
也許正則表達式不是解析此問題的最佳方法,請告訴我是否不是。 無論如何,這是語法樹的一些示例:
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
無論如何,我想做的是拔出連接詞(然后是一次,依此類推)及其對應的頭(CC,IN,CC),我已經知道每個語法樹,因此它可以充當錨,並且我還需要檢索其父級(第一個是S,第二個是SBARTMP,第三個是S),以及是否有兄弟姐妹(在第一個兄弟姐妹中,在第二個左側兄弟姐妹中,還有第三個兄弟姐妹)左側和右側同級)。 比父母更高的東西都不包括在內
my $pos = "(\\\w|-)*";
my $sibling = qr{\s*(\\((?:(?>[^()]+)|(?1))*\\))\s*};
my $connective = "once";
my $re = qr{(\(\w*\s*$sibling*\s*\\(IN\s$connective\\)\s*$sibling*\s*\))};
此代碼適用於以下情況:
my $test1 = "(X (SBAR-TMP (IN once) (S sdf) (S sdf)))";
my $test2 = "(X (SBAR-TMP (IN once))";
my $test3 = "(X (SBAR-TMP (IN once) (X as))";
my $test4 = "(X (SBAR-TMP (X adsf) (IN once))";
它將把X扔掉,並保留其他所有內容,但是,一旦兄弟姐妹中嵌入了某些東西,則它不匹配,因為正則表達式不會更深入。
my $test = "(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))";
我不確定該如何解決。 我是Perl擴展模式的新手,剛剛開始學習它。 為了澄清一下正則表達式的功能:它在兩個括號內查找連接詞和大寫字母/-組合,在兩個括號內尋找具有相同格式的完整父級,然后應查找任意數量的同級他們所有的括號都配對了。
要僅獲取與錨定連接詞最接近的“父”,可以將其作為具有FAIL的遞歸父或直接進行。 (由於某種原因,我無法編輯其他帖子,必須刪除Cookie)。
use strict;
use warnings;
my $connective = qr/ \((?:IN|CC)\s(?:once|and|then)\)/x;
my $sibling = qr/
\s*
(
(?! $connective )
\(
(?:
(?> (?: [^()]+ ) )
| (?-1)
)*
\)
)
\s*
/x;
my $regex1 = qr/
\( ( [\w-]+ \s* $sibling* \s* $connective \s* $sibling* ) \) #1
/x;
my $regex2 = qr/
( #1
\( \s*
( #2
[\w-]+ \s*
(?> $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
| (?1)
)
)
\s*
\)
)
/x;
my $sample = qq/
(X (SBAR-TMP (IN once) (S sdf) (S sdf)))
(X (SBAR-TMP (IN once))
(X (SBAR-TMP (IN once) (X as))
(X (SBAR-TMP (X adsf) (IN once))
(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
/;
while ($sample =~ /$regex1/xg) {
print "Found: $1\n";
}
print '-' x 20, "\n";
while ($sample =~ /$regex2/xg) {
print "Found: $2\n";
}
__END__
你為什么要放棄這個,你幾乎做到了。 嘗試這個:
use strict;
use warnings;
my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
my $sibling = qr/
\s*
(
(?!$connect)
\(
(?:
(?> (?: [^()]+ ) )
| (?-1)
)*
\)
)
\s*
/x;
my $regex = qr/
( #1
\(
\s* [\w-]+ \s*
(?> $sibling* \s* $connective \s* $sibling*
| (?1)
)
\s*
\)
)
/x;
my @tests = (
'(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',
'(X (SBAR-TMP (IN once))',
'(X (SBAR-TMP (IN once) (X as))',
'(X (SBAR-TMP (X adsf) (IN once))',
);
for my $sample (@tests)
{
while ($sample =~ /$regex/xg) {
print "Found: $1\n";
}
}
my $another =<<EOS;
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S
(S
(NP blah
(VP blah)
)
(CC then)
(NP blah
(VP blah
(PP blah)
)
)
)
)
EOS
print "\n---------\n";
while ($another =~ /$regex/xg) {
print "\nFound:\n$1\n";
}
結束
這也應該工作
use strict;
use warnings;
my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
my $sibling = qr/
(?: \s*
(
(?!$connective)
\(
(?:
(?> (?: [^()]+ ) )
| (?-1)
)*
\)
)
\s* )
/x;
my $regex = qr/
( #1
\( \s*
( #2
[\w-]+ \s*
(?> $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
| (?1)
)
)
\s*
\)
)
/x;
my @tests = (
'(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',
'(X (SBAR-TMP (IN once))',
'(X (SBAR-TMP (IN once) (X as))',
'(X (SBAR-TMP (X adsf) (IN once))',
'(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))',
);
for my $sample (@tests)
{
while ($sample =~ /$regex/xg) {
print "Found: $2\n";
}
}
my $another = "
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
";
print "\n---------\n";
while ($another =~ /$regex/xg) {
print "\nFound:\n$2\n";
}
__END__
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.