I'm new to everything. Please help. I'm trying to crawl every
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
in a webpage. I want to catch the /v/name/idlike123123ksajdfk part. (Knowing that the
<div class="name"><a href="/v/
part is fixed) So I wrote the regular expression (can make you laugh):
~m#<div class="name"><a href="(/v/.*?)">#
It will be very helpful if you correct my stupid code.
Using a robust HTML parser (see http://htmlparsing.com/ for why):
use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
HTML
my @v_links = $w->find('div.name > a[href^="/v/"]')->attr('href');
There are plenty of Perl modules that extract links from HTML. WWW::Mechanize , Mojo::DOM , HTML::LinkExtor , and HTML::SimpleLinkExtor can do it.
Web scraping with Mojolicious is probably simplest way to do it in Perl nowadays
http://mojolicio.us/perldoc/Mojolicious/Guides/Cookbook#Web_scraping
You should not use regex for parsing HTML , as there are many libraries for such parsing.
Daxim's answer is good example.
However if you want to use regex anyway and you have your text assigned to $_
, then
my @list = m{<div class="name"><a href="(/v/.*?)">}g;
will get you a list of all findings.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.