简体   繁体   中英

please help me to define a perl regular expression

I'm new to everything. Please help. I'm trying to crawl every

<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>

in a webpage. I want to catch the /v/name/idlike123123ksajdfk part. (Knowing that the

<div class="name"><a href="/v/

part is fixed) So I wrote the regular expression (can make you laugh):

~m#<div class="name"><a href="(/v/.*?)">#

It will be very helpful if you correct my stupid code.

Using a robust HTML parser (see http://htmlparsing.com/ for why):

use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
HTML

my @v_links = $w->find('div.name > a[href^="/v/"]')->attr('href');

There are plenty of Perl modules that extract links from HTML. WWW::Mechanize , Mojo::DOM , HTML::LinkExtor , and HTML::SimpleLinkExtor can do it.

Web scraping with Mojolicious is probably simplest way to do it in Perl nowadays

http://mojolicio.us/perldoc/Mojolicious/Guides/Cookbook#Web_scraping

You should not use regex for parsing HTML , as there are many libraries for such parsing.

Daxim's answer is good example.


However if you want to use regex anyway and you have your text assigned to $_ , then

my @list = m{<div class="name"><a href="(/v/.*?)">}g;

will get you a list of all findings.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM