简体   繁体   中英

WWW::Mechanize is having trouble regexing for html entities

I'm attempting to grab the first (rule) link on this page (and other similar pages). The markup for that link is rather crusty by web 2.0 standards:

<A HREF="readtac$ext.TacPage?sl=R&app=9&p_dir=&p_rloc=&p_tloc=&p_ploc=&pg=1&p_tac=&ti=1&pt=1&ch=3&rl=1" 
   NAME="&sect;3.1">
  <font SIZE="4">&sect;3.1</font>
</A>

I'm using the following perl code to attempt to snatch it:

my $rule = $mech->find_link(name_regex => qr/^&sect;\d+/i);

The only thing I can figure is that the upper cased attribute is confusing this... Other trivial tests on better-written markup let me use any of the link methods in Mechanize . For a few minutes it had me scratching my head, wondering if I needed to escape the ampersand. Then I switched to text_regex... but I think the shitty font tag is messing that up.

So I have two related questions: First, is WWW::Mechanize::Link written in such a way that these attributes are assumed to be lowercase (and how is it seeing the href then?). Second, is there some obvious workaround that I'm missing here?

[edit] Hobbs was correct, this is a case of Mechanize munging the entities back to the actual character. You don't get raw source html. The literal character probably didn't work because of some character encoding issue (it's perl after all), but the hex escape did the trick. If you want to field an answer, Hobbs, I'll mark it as the correct one.

Still trying to figure out what's wrong with Stackoverflow that I'm being downvoted for a question regarding something you have to be slightly clever with... I guess everyone's just trying to score points or do the Wikipedia exclusionist thing. Might be time to stop asking questions here altogether.

It is not immediately clear to me why the search on name fails, as it does for me, too.

However, the end of the url in these links (for "rules") identifies them in the same way as the "name" does, and we can pick them up with url_regex => qr/rl=\\d+$/ .

use warnings 'all';
use strict;    

use WWW::Mechanize;

my $url = 'your url';

my $mech = WWW::Mechanize->new();

$mech->get( $url )  or die "Can't get url: $!";

my @all_rules = $mech->find_all_links( url_regex => qr/rl=\d+$/ );

print "$_->url()\n" for @all_rules;

This prints urls for all those 10 rules (with the url from your link).

Another venue would be to get the table, second on the page, and thus separate that content from the rest. Then pry links out as the sole content of their <td> . The HTML::TableExtract can help with that, also bypassing encoding problems since we can get raw HTML (tested).


Update

This answer clearly uses a different approach, searching for the link itself instead.

Since it was explained that there are " 4000 separate pages " and links aren't as reliable, the best way may well be to go for the name attribute, just as asked for. In this case one should consider the whole entity &sect; , as shown in other answers (and in an early comment by hobbs ).

A comment on reasoning behind this answer. A search for an HTML entity in the name attribute rests entirely on a particular character, so a more rounded approach should be useful to have as well. For example, one may identify patterns in the sought content (in this case text of the links), or use the document structure to zero in on the content (in this case the table).

You have &sect; which is simply § in an HTML document, not the literal string &sect; . Therefore, your pattern has to take that into account.

I am going to go out on a limb and guess that name_regex => qr/^\\xa7\\d+/ will do the trick.

#!/usr/bin/env perl

use strict;
use warnings;

use WWW::Mechanize;

my $mech = WWW::Mechanize->new;
$mech->get( 'http://texreg.sos.state.tx.us/public/readtac$ext.ViewTAC?tac_view=5&ti=16&pt=1&ch=1&sch=A&rl=Y' );

my $n = 1;

while (my $link = $mech->find_link(name_regex => qr/^\xA7\d+/, n => $n++)) {
    print $link->url, "\n"
}

Output:

readtac$ext.TacPage?sl=R&app=9&p_dir=&p_rloc=&p_tloc=&p_ploc=&pg=1&p_tac=&ti=16&pt=1&ch=1&rl=1
readtac$ext.TacPage?sl=R&app=9&p_dir=&p_rloc=&p_tloc=&p_ploc=&pg=1&p_tac=&ti=16&pt=1&ch=1&rl=2
readtac$ext.TacPage?sl=R&app=9&p_dir=&p_rloc=&p_tloc=&p_ploc=&pg=1&p_tac=&ti=16&pt=1&ch=1&rl=3
readtac$ext.TacPage?sl=R&app=9&p_dir=&p_rloc=&p_tloc=&p_ploc=&pg=1&p_tac=&ti=16&pt=1&ch=1&rl=4
readtac$ext.TacPage?sl=R&app=9&p_dir=&p_rloc=&p_tloc=&p_ploc=&pg=1&p_tac=&ti=16&pt=1&ch=1&rl=5
readtac$ext.TacPage?sl=R&app=9&p_dir=&p_rloc=&p_tloc=&p_ploc=&pg=1&p_tac=&ti=16&pt=1&ch=1&rl=6
readtac$ext.TacPage?sl=R&app=9&p_dir=&p_rloc=&p_tloc=&p_ploc=&pg=1&p_tac=&ti=16&pt=1&ch=1&rl=7
readtac$ext.TacPage?sl=R&app=9&p_dir=&p_rloc=&p_tloc=&p_ploc=&pg=1&p_tac=&ti=16&pt=1&ch=1&rl=8
readtac$ext.TacPage?sl=R&app=9&p_dir=&p_rloc=&p_tloc=&p_ploc=&pg=1&p_tac=&ti=16&pt=1&ch=1&rl=9
readtac$ext.TacPage?sl=R&app=9&p_dir=&p_rloc=&p_tloc=&p_ploc=&pg=1&p_tac=&ti=16&pt=1&ch=1&rl=10

PS: Next time, please post a short, self-contained script which others can run by copying and pasting instead of fragments.

I don't know what you're doing but this code finds a link:

use strict;
use warnings;
use utf8;

use WWW::Mechanize;

my $mech = WWW::Mechanize->new;

my $url = 'http://texreg.sos.state.tx.us/public/readtac$ext.ViewTAC?tac_view=5&ti=16&pt=1&ch=1&sch=A&rl=Y';

$mech->get($url);

if (my $link = $mech->find_link(name_regex => qr/^§\d/)) {
    print "Found ", $link->url, "\n";
} else {
    print "Not found\n";
}

Output:

Found readtac$ext.TacPage?sl=R&app=9&p_dir=&p_rloc=&p_tloc=&p_ploc=&pg=1&p_tac=&ti=16&pt=1&ch=1&rl=1

As others have said, the problem is that WWW::Mechanize expands the entity &sect; in the attribute's value, so you should be looking for a "section sign" character § instead.

I much prefer the name silcrow because it pairs with the name for the paragraph sign pilcrow . But, sadly, the Unicode Consortium has heard only of the latter.

The data from the site is UTF-8-encoded, so you must use that character in the same encoding for it to match. If you use just § in your program without a previous use utf8 then the character will appear as the two-byte encoding "\\xC2\\xA7" , which also will not match

There obvious ways around it are

  • To define the character by its unicode name, which I think is best because it is the most clear and allows the whole program to be written in ASCII. The pattern would be qr/^\\N{SECTION SIGN}[\\d.]+$/

  • To use utf8 at the top of your code, and then a pattern of qr/^§[\\d.]+$/

  • To specify the code point in the regex pattern qr/^\\xA7[\\d.]+$/ . This is the least readable of all, because few people have any idea what Unicode character U+00A7 is

Here's a program that extracts the ten links from the URL that you give in your question

use strict;
use warnings 'all';

use WWW::Mechanize;

use constant URL => 'http://texreg.sos.state.tx.us/public/readtac$ext.ViewTAC?tac_view=5&ti=16&pt=1&ch=1&sch=A&rl=Y';

my $mech = WWW::Mechanize->new;

$mech->get(URL);

my $rule = $mech->find_all_links(name_regex => qr/^\N{SECTION SIGN}[\d.]+$/);

printf "%d matches\n", scalar @$rule;

output

10 matches



Update

Yet another way, which you may prefer, is to use the HTML::Entities module to translate exactly the text that is used in the HTML

This example defines the variable $sect with

my $sect = decode_entities('&sect;')

and then uses it in a regex pattern qr/^$sect[\\d.]+$/

It produces exactly the same output as the code sample above

use strict;
use warnings 'all';

use WWW::Mechanize;
use HTML::Entities 'decode_entities';

use constant URL => 'http://texreg.sos.state.tx.us/public/readtac$ext.ViewTAC?tac_view=5&ti=16&pt=1&ch=1&sch=A&rl=Y';

my $mech = WWW::Mechanize->new;

$mech->get(URL);

my $sect = decode_entities('&sect;');
my $rule = $mech->find_all_links(name_regex => qr/^$sect[\d.]+$/);

printf "%d matches\n", scalar @$rule;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM