简体   繁体   English

Perl WWW :: Mechanize Web Spider。 如何查找所有链接

[英]Perl WWW::Mechanize Web Spider. How to find all links

I am currently attempting to create a Perl webspider using WWW::Mechanize. 我目前正在尝试使用WWW :: Mechanize创建一个Perl webspider。

What I am trying to do is create a webspider that will crawl the whole site of the URL (entered by the user) and extract all of the links from every page on the site. 我想要做的是创建一个webspider,它将抓取 URL 的整个站点 (由用户输入)并站点的每个页面中提取所有链接

What I have so far: 到目前为止我所拥有的:

use strict;
use WWW::Mechanize;

my $mech = WWW::Mechanize->new();

my $urlToSpider = $ARGV[0];
$mech->get($urlToSpider);

print "\nThe url that will be spidered is $urlToSpider\n";

print "\nThe links found on the url's starting page\n";

my @foundLinks = $mech->find_all_links();

foreach my $linkList(@foundLinks) {

    unless ($linkList->[0] =~ /^http?:\/\//i || $linkList->[0] =~ /^https?:\/\//i) {

        $linkList->[0] = "$urlToSpider" . $linkList->[0];
    }

    print "$linkList->[0]";
    print "\n";
}

What it does: 它能做什么:

1. At present it will extract and list all links on the starting page 1.目前它将提取并列出起始页面上的所有链接

2. If the links found are in /contact-us or /help format it will add 'http://www.thestartingurl.com' to the front of it so it becomes 'http://www.thestartingurl.com/contact-us'. 2.如果找到的链接是/ contact-us或/ help格式,它会在其前面添加“http://www.thestartingurl.com”,以便它变为'http://www.thestartingurl.com/contact -我们'。

The problem: 问题:

At the moment it also finds links to external sites which I do not want it to do, eg if I want to spider 'http://www.tree.com' it will find links such as http://www.tree.com/find-us . 目前它还找到了我不希望它做的外部网站的链接,例如,如果我想蜘蛛'http://www.tree.com',它会找到http://www.tree等链接。 com / find-us However it will also find links to other sites like http://www.hotwire.com . 但它也会找到其他网站的链接,如http://www.hotwire.com

How do I stop it finding these external urls? 如何阻止它找到这些外部网址?

After finding all the urls on the page I then also want to save this new list of internal-only links to a new array called @internalLinks but cannot seem to get it working. 找到页面上的所有网址后,我还想将这个新的内部链接列表保存到名为@internalLinks的新数组中,但似乎无法使其正常工作。

Any help is much appreciated, thanks in advance. 非常感谢任何帮助,在此先感谢。

This should do the trick: 这应该做的伎俩:

my @internalLinks = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);

If you don't want css links try: 如果你不想要css链接试试:

my @internalLinks = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/, tag => 'a');

Also, the regex you're using to add the domain to any relative links can be replaced with: 此外,您用于将域添加到任何相对链接的正则表达式可以替换为:

print $linkList->url_abs();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM