如何在Perl中提取URL标记并从HTML链接文本？

Question

I have a page which contains this: 我有一个包含以下内容的页面：

<a href="http://www.trial.com" title="yellow">Trial</a>
<a href="http://www.trial1.com" title="red">Trial2</a>

How can I get the anchor text, URL and title? 如何获取锚文本，URL和标题？

I want to have this output: 我想要这个输出：

Trial, http://www.trial.com, yellow
Trial2, http://www.trial1.com, red

I have tried to use WWW::Mechanize as explained also here , but I do not know how to get the title in this way. 我已经尝试使用WWW :: Mechanize ，如此处所述，但我不知道如何以这种方式获得标题。 Do you have any ideas? 你有什么想法？

Answer 1

The simple version, based on your question 简单版本，根据您的问题

a page that looks like yours (so no obscure html that can mess up) 看起来像您的页面（因此，不会混淆的html可能会搞乱）
te desired output 所需的输出

This might be what you are looking for: 这可能是您要寻找的：

use strict;
use warnings;

use WWW::Mechanize;

my $mech = WWW::Mechanize->new;
$mech->get('file:page.html');

foreach my $link ($mech->links) {
    my $text  = $link->text;
    my $url   = $link->url;
    my $title = $link->attrs->{title};

    print "$text, $url, $title\n"
}

Happy coding, TIMTOWTDI 快乐编码，TIMTOWTDI

Answer 2

Using the documentation provided in your question. 使用问题中提供的文档。 I created something that solves your problem I believe. 我创造了可以解决您所相信的问题的东西。 Obviously using https://www.perlmonks.org has some outliers as some of the URLs are not full URLs, but with some simple checking and skipping if it is not what you want I think you'd get what you want. 显然，使用https://www.perlmonks.org存在一些异常，因为某些URL不是完整的URL，但是通过一些简单的检查和跳过（如果不是所需的话），我想您会得到所需的。

Example Output : 示例输出 ：

_____________________________________________________________________________________________________________________________
| Text                                            | URL                | Attributes
_____________________________________________________________________________________________________________________________
| Testing a metacpan dist with XS components      | ?node_id=1216149   | [name]post-head-id1216149[id]post-head-id1216149,  |
| Controlling the count in array                  | ?node_id=1216134   | [name]post-head-id1216134[id]post-head-id1216134,  |

Likely you were stumped at the hashref. 可能您对hashref感到困惑。 You just needed to make a for loop to go through those to get the attribute tags and their values. 您只需要创建一个for循环来遍历那些，即可获取属性标签及其值。

Code: 码：

#!/usr/bin/perl
# your code goes here
use strict;
use warnings;
use Data::Dumper;
use WWW::Mechanize ();
$ENV{'PERL_LWP_SSL_VERIFY_HOSTNAME'} = 0;

my @urls = (
    q{https://www.perlmonks.org/}
);

my $mech = WWW::Mechanize->new();
$mech->get(@urls);

my @links = $mech->links();
print qq{______________________________________________________________________________________________________________\n};
printf(qq{| %-65s | %-75s | %-25s\n},q{Text},q{URL},q{Attributes});
print qq{______________________________________________________________________________________________________________\n};
foreach my $link (@links) {
    if ($link->text() && $link->url()) {
        my $a;
        foreach my $attr (keys %{$link->attrs()}) {
            next if $attr =~ m/href/i;

            #$link->attr()->{$attr} is the value of the key in this hashref. 
            $a .= qq{[$attr]} . $link->attrs()->{$attr};
        }
        my $info;
        if ($a) {
            $info = sprintf(qq{| %-65s | %-75s | %-25s, },$link->text(),$link->url(),$a);
        } else {
            $info = sprintf(qq{| %-65s | %-75s |},$link->text(),$link->url());
        }
        print $info . qq{ |\n};
    }
}

如何在Perl中提取URL标记并从HTML链接文本？

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-06-09 09:44:07

解决方案2
0 2018-06-09 03:55:09

如何在Perl中提取URL标记并从HTML链接文本？

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-06-09 09:44:07

解决方案2 0 2018-06-09 03:55:09

解决方案1
2 已采纳 2018-06-09 09:44:07

解决方案2
0 2018-06-09 03:55:09