在使用Mojo :: DOM處理HTML文檔時，如何最可靠地保留HTML實體？

Question

我正在使用Mojo :: DOM來識別和打印數百個HTML文檔中的短語（意思是所選HTML標簽之間的文本字符串），這些HTML文檔是我從Movable Type內容管理系統中的現有內容中提取的。

我正在將這些短語寫到文件中，因此可以將它們翻譯成其他語言，如下所示：

        $dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));

    ##########
    #
    # Break down the Body into phrases. This is done by listing the tags and tag combinations that
    # surround each block of text that we're looking to capture.
    #
    ##########

        print FILE "\n\t### Body\n\n";        

        for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {

            print_phrase($phrase); # utility function to write out the phrase to a file

        }

當Mojo :: DOM遇到嵌入式HTML實體（例如™和  ）時，它會將這些實體轉換為編碼字符，而不是像編寫的那樣傳遞。 我希望實體按照書面形式傳遞。

我認識到我可以使用Mojo :: Util :: decode將這些HTML實體傳遞給我正在編寫的文件。 問題是“ 你只能在包含有效UTF-8的字符串上調用解碼'UTF-8'。如果沒有，例如因為它已經轉換為Perl字符，它將返回undef。”

如果是這種情況，我必須在調用Mojo::Util::decode('UTF-8', $page->text)之前嘗試弄清楚如何測試當前HTML頁面的編碼，或者我必須使用其他一些技術來保存編碼的HTML實體。

在使用Mojo :: DOM處理HTML文檔時，如何最可靠地保留編碼的HTML實體？

Answer 1

看起來當您映射到文本時，您可以替換XML實體，但是當您使用節點並使用其內容時，將保留實體。 這個最小的例子：

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p>this &amp; &quot;that&quot;</p>');
for my $phrase ($dom->find('p')->each) {
    print $phrase->content(), "\n";
}

打印：

this &amp; &quot;that&quot;

如果你想保留你的循環和地圖，用map('content')替換map('text') ，如下所示：

for my $phrase ($dom->find('p')->map('content')->each) {

如果您有嵌套標簽並且只想查找文本（但不打印那些嵌套標簽名稱，只打印它們的內容），則需要掃描DOM樹：

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p><i>this &amp; <b>&quot;</b><b>that</b><b>&quot;</b></i></p><p>done</p>');

for my $node (@{$dom->find('p')->to_array}) {
    print_content($node);
}

sub print_content {
    my ($node) = @_;
    if ($node->type eq "text") {
        print $node->content(), "\n";
    }
    if ($node->type eq "tag") {    
        for my $child ($node->child_nodes->each) {
            print_content($child);
        }
    }
}

打印：

this & 
"
that
"
done

Answer 2

通過測試，我和我的同事能夠確定Mojo::DOM->new() ）自動解碼＆符號（ & ），使得HTML實體的保存不可能。 為了解決這個問題，我們添加了以下子程序來對符號進行雙重編碼：

sub encode_amp {
    my ($text) = @_;

    ##########
    #
    # We discovered that we need to encode ampersand
    # characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
    # automatically by Mojo::DOM::Util::html_unescape().
    #
    # What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
    # any incoming ampersand or &amp; characters.
    #
    #
    ##########   

    $text .= '';           # Suppress uninitialized value warnings
    $text =~ s!&!&amp;!g;  # HTML encode ampersand characters
    return $text;
}

稍后在腳本中，當我們實例化一個新的Mojo::DOM對象時，我們通過encode_amp()傳遞$page->text 。

    $dom = Mojo::DOM->new(encode_amp($page->text));

##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
# Note that "h2 b" is an important tag combination for capturing major headings on pages
# in this theme. The tags "span" and "a" are also.
#
# We added caption and th to support tables.
#
# We added li and li a to support ol (ordered lists) and ul (unordered lists).
#
# We got the complicated map('descendant_nodes') logic from @Grinnz on StackOverflow, see:
# https://stackoverflow.com/questions/55130871/how-do-i-most-reliably-preserve-html-entities-when-processing-html-documents-wit#comment97006305_55131737
#
#
# Original set of selectors in $dom->find() below is as follows:
#   'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
#
##########

    print FILE "\n\t### Body\n\n";        

    for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
        map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {           

        print_phrase($phrase);

    }

上面的代碼塊包含了@Grinnz之前的建議，如本問題的評論中所示。 還要感謝@Robert的回答，它很好地觀察了Mojo::DOM工作原理。

這段代碼絕對適用於我的應用程序。

在使用Mojo :: DOM處理HTML文檔時，如何最可靠地保留HTML實體？

問題描述

2 個解決方案

解決方案1
3 2019-03-12 22:44:47

解決方案2
0 已采納 2019-04-10 02:46:03

在使用Mojo :: DOM處理HTML文檔時，如何最可靠地保留HTML實體？

問題描述

2 個解決方案

解決方案1 3 2019-03-12 22:44:47

解決方案2 0 已采納 2019-04-10 02:46:03

解決方案1
3 2019-03-12 22:44:47

解決方案2
0 已采納 2019-04-10 02:46:03