简体   繁体   English

从Perl中的HTMl / XML标记中提取文本

[英]Extract text from HTMl/XML tags in Perl

I have a HTTPS response like this 我有这样的HTTPS响应

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        <title>Some tittle &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;

</title>
    </head>
    <body>
        <h2>Some h2</h2>
        <p>some text:

            <pre>    text &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;
  &lt;key name="some variable"&gt;
    &lt;value&gt;1024&lt;/value&gt;
  &lt;/key&gt;
&lt;/localconfig&gt;
</pre>
        </p>
        <hr>
        <i>
            <small>Some text</small>
        </i>
        <hr/>
    </body>
</html>
  • The key's name are statics, and i need to use a variable to grab specific values. 密钥的名称是静态的,我需要使用变量来获取特定值。
  • I'm using decide_entities to parse the text to html 我正在使用decision_entities将文本解析为html
  • Sometimes the key is posted twice in the response, but it's the same value. 有时密钥会在响应中发布两次,但它的值相同。

XML::LibXML don't help much here since it's not a correct XML file/string. XML::LibXML在这里没有多大帮助,因为它不是一个正确的XML文件/字符串。

I tried to use Regex to get it like this 我尝试使用正则表达式来实现它

sub get_key {
    my $start = '<key name="'.$_[0].'">\n<value>';
    print $_[1];
    my $end = "</value>";
    print " [*] Trying to get $_[0]\n";
    print "Start: $start  --- End $end";
    if($_[1] =~ /\b$start\b(.*?)\b$end\b/s){
        my $result = $1;
        print $result, "\n\n";
        return $result;
    }
}

get_key("string_to_search", $string_from_response);

I need to extract the key between the key and value 我需要提取键和值之间的关键

<key name="variable">
 <value>Grab me</value>
</key>

Once you've extracted the embedded XML document, you should use a proper XML parser. 一旦提取了嵌入的XML文档,就应该使用正确的XML解析器。

use XML::LibXML qw( );

my $xml_doc = XML::LibXML->new->parse_string($xml);

for my $key_node ($xml_doc->findnodes("/localconfig/key")) {
   my $key = $key_node->getAttribute("name");
   my $val = $key_node->findvalue("value/text()");
   say "$key: $val";
}

So that leaves us with the question how to extract the XML document. 这就让我们不知道如何提取XML文档。

Option 1: XML::LibXML 选项1:XML :: LibXML

You could use XML::LibXML and simply tell it to ignore the error (the spurious </p> tag). 你可以使用XML :: LibXML并简单地告诉它忽略错误(假的</p>标签)。

my $html_doc = XML::LibXML->new( recover => 2 )->parse_html_fh($html);
my $xml = encode_utf8( $html_doc->findvalue('/html/body/pre/text()') =~ s/^[^<]*//r );

Option 2: Regex Match 选项2:正则表达式匹配

You could probably get away with using a regex pattern match. 您可能可以使用正则表达式模式匹配。

use HTML::Entities qw( decode_entities );

my $xml = decode_entities( ( $html =~ m{<pre>[^&]*(.*?)</pre>}s )[0] );

Option 3: Mojo::DOM 选项3:Mojo :: DOM

You could use Mojo::DOM to extract the embedded XML document. 您可以使用Mojo :: DOM来提取嵌入的XML文档。

use Encode    qw( decode encode_utf8 );
use Mojo::DOM qw( );

my $decoded_html = decode($encoding, $html);
my $html_doc = Mojo::DOM->new($decoded_html);    
my $xml = encode_utf8( $html_doc->at('html > body > pre')->text =~ s/^[^<]*//r );

The problem with Mojo::DOM is that you need to know the encoding of the document before you pass the document to the parser (because you must pass it decoded), but you need to parse the document in order to extract the encoding of the document form the document. Mojo :: DOM的问题是你需要在将文档传递给解析器之前知道文档的编码(因为你必须传递它解码),但你需要解析文档以提取编码文件形成文件。

(Of course, you could use Mojo::DOM to parse the XML too.) (当然,您也可以使用Mojo :: DOM来解析XML。)


Note that the HTML fragment <p><pre></pre></p> means <p></p><pre></pre> , and both XML::LibXML and Mojo::DOM handle this correctly. 请注意,HTML片段<p><pre></pre></p>表示<p></p><pre></pre> ,XML :: LibXML和Mojo :: DOM都正确处理。

The hard part of this problem is that the presented document mixes formats -- it has a valid HTML structure, but also with XML-like elements which appear "tossed-in" without a particular pattern. 这个问题的难点在于所呈现的文档混合了格式 - 它具有有效的HTML结构,但也具有类似XML的元素,这些元素在没有特定模式的情况下显得“被抛入”。 There are ways to disentangle these parts, even as they aren't bulletproof and come with trade-offs. 有很多方法可以解开这些部分,即使它们不是防弹的,也需要权衡利弊。

In this case XML::LibXML can do the whole job, as it can deal with bad data, but note warnings. 在这种情况下, XML :: LibXML可以完成整个工作,因为它可以处理坏数据,但会注意警告。

use warnings;
use strict;
use feature 'say';

use Encode qw(encode_utf8); 
use XML::LibXML;

my $html_doc = XML::LibXML->new(recover => 2)->parse_html_fh(\*DATA);
my $xml = encode_utf8( 
    $doc->findvalue('/html/body/pre/text()') =~ s/^[^<]*//r 
);
my $xml_doc = XML::LibXML->new->parse_string($xml);

say for $xml_doc->findnodes('//key');  # node object stringifies

__DATA__
<html>
    <head> 
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        <title>Some tittle &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;

</title>
    </head>
    <body>
        <h2>Some h2</h2>
        <p>some text:

            <pre>    text &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;
  &lt;key name="some variable"&gt;
    &lt;value&gt;1024&lt;/value&gt;
  &lt;/key&gt;
&lt;/localconfig&gt;
</pre>
        </p>
        <hr>
        <i>
            <small>Some text</small>
        </i>
        <hr/>
    </body>
</html>

The parser option recover is what allows the above parsing to go through 解析器选项recover允许上述解析通过

A true value turns on recovery mode which allows one to parse broken XML or HTML data. 真值启用恢复模式,允许人们解析损坏的XML或HTML数据。 [...] [...]

As useful as this can be, it of course begs for extreme caution as we are willfully using bad data (or, rather, non-conforming data here). 虽然这很有用,但它当然会引起极其谨慎,因为我们故意使用不良数据(或者更确切地说,这里是不符合要求的数据)。 This case brings two such issues. 这个案例带来了两个这样的问题

  • Regex is needed for entities. 实体需要正则表达式。 The example deals with those under <pre> , but there may be more. 该示例处理<pre>下的那些,但可能还有更多。 We need to inspect input and may need code changes for different data. 我们需要检查输入,可能需要对不同数据进行代码更改。

  • This makes use of the observation that the XML-like "tags" are given by entities ( &lt; etc), which are left as they are during parsing and only decoded later. 这利用了类似XML的“标签”由实体( &lt; etc)给出的观察结果,它们在解析期间保持原样并且仅在稍后解码。 However ... 但是......

  • ... this isn't a rule and if some aren't given that way (but rather as <key> ), then those can make the library parse the document into a (slightly) different tree . ...这不是一个规则,如果有一些不是这样的(而是作为<key> ),那么那些可以使库将文档解析成(略微) 不同的树 This again requires inspection of input, and possibly code adjustments for any new data. 这再次需要检查输入,并且可能需要对任何新数据进行代码调整。

Thanks to ikegami for bringing up the point of first parsing the data and only then dealing with the entities, for a discussion, and for the XML-code above. 感谢ikegami提出了首先解析数据的问题,然后才讨论实体,讨论以及上面的XML代码。 The original version of the XML-related code above first decoded and so ended up with a slightly different tree. 上面的XML相关代码的原始版本首先被解码,因此结果略有不同。

Also note that HTML::TreeBuilder does process this data with ignore_unknown set. 另请注意, HTML::TreeBuilder使用ignore_unknown set处理此数据。 Then the problem is that these new "tags" ( <key> etc) are just data for it, so any practical use of the obtained tree would probably have to rely on regex. 然后问题是这些新的“标签”( <key> etc)只是它的数据,因此获得的树的任何实际用途都可能不得不依赖于正则表达式。


One other way to deal with this data is with the flexible, high-level HTML parser, Marpa::HTML . 另一种处理这些数据的方法是使用灵活的高级HTML解析器Marpa :: HTML

A very basic demo 一个非常基本的演示

use warnings;
use strict;
use feature 'say';

use Marpa::HTML qw(html);
use HTML::Entities qw(decode_entities);    

my $input = do { local $/; <DATA> };    
my $html = decode_entities($input);

my (@attrs, @cont);

my $marpa_key = Marpa::HTML::html( 
    \$html,
    {
        'key' => sub {
            push @attrs, Marpa::HTML::attributes();
            push @cont, Marpa::HTML::contents();
        },
    }
);

for my $i (0..$#cont) {
    say "For attribute \"name=$attrs[$i]->{name}\" the <key> has: $cont[$i]"
}

__DATA__
...the same as in the first example, data from the question...

This collects views as it parses, using API for attributes and contents , for element <key> . 这会在使用API​​为attributescontents解析元素<key>收集视图

It may in principle be suitable for your problem since it accepts the mere semantics of <...> as an element. 原则上可能适合您的问题,因为它只接受<...>作为元素的语义。 But those aren't treated as XML, what may be one downside if your data relies on XML more than shown. 但是那些不被视为XML,如果您的数据依赖于XML而不是显示,那么可能是一个缺点。 And, of course, this is a different approach with its own rules. 当然,这是一种不同的方法,有自己的规则。

Note that the basic logic and use of the module is that each coderef returns , and this return is used for the element that it fired on; 请注意,模块的基本逻辑和用法是每个coderef returns ,并且此返回用于它触发的元素; the rest of text is unchanged. 其余的文字没有变化。 So this is natural for changing particular elements of a document. 因此,这对于更改文档的特定元素是很自然的。

I've used it differently above, only to collect information about the "tags." 我上面使用的不同,只是收集有关“标签”的信息。 That code prints 该代码打印

For attribute "name=ssl_default" the <key> has: 
    <value>sha256</value>

For attribute "name=some variable" the <key> has: 
    <value>1024</value>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM