Get HTML within an <a> element using WWW::Mechanize

Question

I'm extracting special links within an HTML page by using WWW::Mechanize .

my $mech = WWW::Mechanize->new();

$mech->get( $uri );

my @links = $mech->find_all_links(url_regex => qr/cgi-bin/);

for my $link ( @links ) {
    # try to get everything between <a href="[...]">HERE</a>
}

The links look like this

<a href="[...]"><div><div><span>foo bar</span> I WANT THIS TEXT</div></div></a>

By using $link->text I get foo bar I WANT THIS TEXT without knowing which text was inside the <span> element.

Is there any way to get the raw HTML code instead of the stripped text?

In other words I need to find a way to only get I WANT THIS TEXT without knowing the exact text within the <span> tag.

Answer 1

As simbabque has said you can't do that with WWW::Mechanize

In fact there's very little point in using WWW::Mechanize if you don't want any of its features. If all you're using it for is to fetch a web page, then use LWP::UserAgent instead. WWW::Mechanize is just a subclass of LWP::UserAgent with lots of additional stuff that you don't want

Here's an example that uses HTML::TreeBuilder to construct a parse tree of the HTML and locate the links that you want. I've used HTML::TreeBuilder because it's pretty good at tolerating malformed HTML in a way similar to modern browsers

I've been unable to test it as you haven't provided proper sample data and I'm not inclined to create my own

use strict;
use warnings 'all';
use feature 'say';

use WWW::Mechanize;
use HTML::TreeBuilder;

my $mech = WWW::Mechanize->new;
$mech->get('http://www.example.com/');

my $tree = HTML::TreeBuilder->new_from_content($mech->content);

for my $link ( @{ $tree->extract_links('a') } ) {

    my ($href, $elem, $attr, $tag) = @$link;

    # Exclude non-CGI links
    next unless $link =~ /cgi-bin/;

    # Find all immediate child text nodes and concatenate them
    # References are non-text children
    my $text = join ' ', grep { not ref } $elem->content_list;
    next unless $text =~ /\S/;

    # Trim and consolidate spaces
    $text =~ s/\A\s+|\s+\z//g;
    $text =~ s/\s+/ /g;

    say $text;
}

Get HTML within an <a> element using WWW::Mechanize

Question

1 answers

solution1
2 2017-06-20 15:15:33

Get HTML within an <a> element using WWW::Mechanize

Question

1 answers

solution1 2 2017-06-20 15:15:33

solution1
2 2017-06-20 15:15:33