简体   繁体   中英

Possible to dump AJAX content from webpage?

I would like to dump all the names on this page and all the remaining 146 pages.

The red/orange previous/next buttons uses JavaScript it seams, and gets the names by AJAX.

Question

Is it possible to write a script to crawl the 146 pages and dump the names?

Does there exist Perl modules for this kind of thing?

You can use WWW::Mechanize or another Crawler for this. Web::Scraper might also be a good idea.

use Web::Scraper;
use URI;
use Data::Dump;

# First, create your scraper block
my $scraper = scraper {
    # grab the text nodes from all elements with class type_firstname (that way you could also classify them by type)
    process ".type_firstname", "list[]" => 'TEXT';
};

my @names;
foreach my $page ( 1 .. 146) {
  # Fetch the page (add page number param)
  my $res = $scraper->scrape( URI->new("http://www.familiestyrelsen.dk/samliv/navne/soeginavnelister/godkendtefornavne/drengenavne/?tx_lfnamelists_pi2[gotopage]=" . $page) );
  # add them to our list of names
  push @names, $_ for @{ $res->{list} };
}

dd \@names;

It will give you a very long list with all the names. Running it may take some time. Try with 1..1 first.

In general, try using WWW::Mechanize::Firefox which will essentially remote-control Firefox.

For that particular page though, you can just use something as simple as HTTP::Tiny .

Just make POST requests to the URL and pass the parameter tx_lfnamelists_pi2[gotopage] from 1 to 146.

Example at http://hackst.com/#4sslc for page #30.

Moral of the story: always look in Chrome's Network tab and see what requests the web page makes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM