I would like to dump all the names on this page and all the remaining 146 pages.
The red/orange previous/next buttons uses JavaScript it seams, and gets the names by AJAX.
Question
Is it possible to write a script to crawl the 146 pages and dump the names?
Does there exist Perl modules for this kind of thing?
You can use WWW::Mechanize or another Crawler for this. Web::Scraper might also be a good idea.
use Web::Scraper;
use URI;
use Data::Dump;
# First, create your scraper block
my $scraper = scraper {
# grab the text nodes from all elements with class type_firstname (that way you could also classify them by type)
process ".type_firstname", "list[]" => 'TEXT';
};
my @names;
foreach my $page ( 1 .. 146) {
# Fetch the page (add page number param)
my $res = $scraper->scrape( URI->new("http://www.familiestyrelsen.dk/samliv/navne/soeginavnelister/godkendtefornavne/drengenavne/?tx_lfnamelists_pi2[gotopage]=" . $page) );
# add them to our list of names
push @names, $_ for @{ $res->{list} };
}
dd \@names;
It will give you a very long list with all the names. Running it may take some time. Try with 1..1
first.
In general, try using WWW::Mechanize::Firefox which will essentially remote-control Firefox.
For that particular page though, you can just use something as simple as HTTP::Tiny .
Just make POST requests to the URL and pass the parameter tx_lfnamelists_pi2[gotopage]
from 1 to 146.
Example at http://hackst.com/#4sslc for page #30.
Moral of the story: always look in Chrome's Network tab and see what requests the web page makes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.