如何在Perl中修改復雜的XML文檔以向文本節點添加其他標記？

Question

我有一個像這樣的XML文檔：

<article>
  <author>Smith</author>
  <date>2011-10-10</date>
  <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz</description>
</article>

我需要在Perl中解析它，然后在一些單詞或短語周圍添加新標簽（例如鏈接到定義）。 我想只標記目標詞的第一個實例，並將我的搜索范圍縮小到給定標記中的內容（例如僅描述標記）。

我可以使用XML :: Twig進行解析，並為description標記設置“twig_handler”。 但是當我調用$ node-> text時，我會刪除帶有插入標簽的文本。 我真正要做的是遍歷（非常小）樹，以便保留現有標簽而不會破壞。 因此，最終的XML輸出應如下所示：

<article>
  <author>Smith</author>
  <date>2011-10-10</date>
  <description>Article about <b><a href="dictionary.html#frobnitz">frobnitz</a></b>, <a href="dictionary.html#crulps">crulps</a> and <a href="dictionary.html#furtikurty">furtikurty</a>'s. Mainly frobnitz</description>
</article>

我也在目標環境中提供了XML :: LibXML ，但我不確定如何從那里開始...

到目前為止，這是我的最小測試用例。 感謝任何幫助！

#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;

my %dictionary = (
    frobnitz    => 'dictionary.html#frobnitz',
    crulps      => 'dictionary.html#crulps',
    furtykurty  => 'dictionary.html#furtykurty',
    );

sub markup_plain_text { 
    my ( $text ) = @_;

    foreach my $k ( keys %dictionary ) {
        $text =~ s/(^|\W)($k)(\W|$)}/$1<a href="$dictionary{$k}">$2<\/a>$3/si;
    }

    return $text;
}

sub convert {
    my( $t, $node ) = @_;
    warn "convert: TEXT=[" . $node->text . "]\n";
    $node->set_text( markup_plain_text($node->text) );
    return 1;
}

sub markup {
    my ( $text ) = @_;

    my $t = XML::Twig->new(
        twig_handlers => { description => \&convert },
        pretty_print  => 'indented',
        );
    $t->parse( $text );

    return $t->flush;
}


my $orig = <<END_XML;
<article>
    <author>Smith</author>
    <date>2011-10-10</date>
    <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz's</description>
</article>
END_XML
;

markup($orig);

Answer 1

這是一個有點棘手的問題，但XML :: Twig是為這種處理而設計的（並且我大量使用它）。 因此，有一種稱為mark的特定方法，它采用正則表達式並標記匹配項。

在這種情況下，正則表達式可能會非常大。 我使用Regexp :: Assempble來構建它，因此它得到了優化。 然后另一個問題是mark不允許你使用匹配的文本來設置屬性（我可能會在模塊的下一個版本中使用它，這將是有用的），所以我必須先標記，然后返回並在第二次傳遞中設置href屬性（在任何情況下，需要第二次傳遞以“取消鏈接”已經鏈接的單詞）。

最后一句話：我幾乎放棄了編寫解決方案，因為你的示例數據有一些拼寫錯誤。 沒有什么比得到正確的代碼更糟糕了，只是為了看到測試仍然失敗，因為你在代碼和數據中的'定義'中使用'字典'，或者'furtykurtle'，'furtikurty'和'furtijurty'它應該全部是同一個詞。 所以，在發布之前，請確保您的數據是正確的。 謝天謝地，我正在編寫代碼作為測試。

#!/usr/bin/perl 

use strict;
use warnings;

use XML::Twig;
use Regexp::Assemble;

use Test::More tests => 1; 
use autodie qw(open);

my %dictionary = (
    frobnitz    => 'definitions.html#frobnitz',
    crulps      => 'definitions.html#crulps',
    furtikurty  => 'definitions.html#furtikurty',
    );

my $match_defs= Regexp::Assemble->new()
                                ->add( keys %dictionary)
                                ->anchor_word
                                ->as_string;
# I am not familiar enough with Regexp::Assemble to know a cleaner
# way to get get the capturing braces in the regexp
$match_defs= qr/($match_defs)/; 

my $in       = data_para(); 
my $expected = data_para();
my $out;
open( my $out_fh, '>', \$out);


XML::Twig->new( twig_roots => { 'description' => sub { tag_defs( @_, $out_fh, $match_defs, \%dictionary); } },
                twig_print_outside_roots => $out_fh, 
              )
         ->parse( $in);

is( $out, $expected, 'base test');
exit;

sub tag_defs
  { my( $t, $description, $out_fh, $match_defs, $dictionary)= @_;

    my @a= $description->mark( $match_defs, 'a' );

    # word => 1 when already used in this description
    # this might need to have a different scope if you need to tag
    # only the first time the word appears in a section or whatever
    my $tagged_in_description; 

    foreach my $a (@a) 
      { my $word= $a->text;
        warn "checking a: ", $a->sprint, "\n";

        if( $tagged_in_description->{$word})
          { $a->erase; } # we did not need to tag it after all
        else
          { $a->set_att( href => $dictionary->{$word}); }
        $tagged_in_description->{$word}++;
      }

    $t->flush( $out_fh); }


sub def_href
  { my( $word)= @_;
    return $dictionary{word};
  }

sub data_para
  { local $/="\n\n";
    my $para= <DATA>;
    return $para;
  }

__DATA__
<article>
  <author>Smith</author>
  <date>2011-10-10</date>
  <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz</description>
</article>

<article>
  <author>Smith</author>
  <date>2011-10-10</date>
  <description>Article about <b><a href="definitions.html#frobnitz">frobnitz</a></b>, <a href="definitions.html#crulps">crulps</a> and <a href="definitions.html#furtikurty">furtikurty</a>'s. Mainly frobnitz</description>
</article>

如何在Perl中修改復雜的XML文檔以向文本節點添加其他標記？

問題描述

1 個解決方案

解決方案1
3 已采納 2011-05-12 08:19:00

如何在Perl中修改復雜的XML文檔以向文本節點添加其他標記？

問題描述

1 個解決方案

解決方案1 3 已采納 2011-05-12 08:19:00

解決方案1
3 已采納 2011-05-12 08:19:00