简体   繁体   English

请帮助我定义一个perl正则表达式

[英]please help me to define a perl regular expression

I'm new to everything. 我是新手。 Please help. 请帮忙。 I'm trying to crawl every 我正在努力爬行

<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>

in a webpage. 在网页中。 I want to catch the /v/name/idlike123123ksajdfk part. 我想抓住/ v / name / idlike123123ksajdfk部分。 (Knowing that the (知道

<div class="name"><a href="/v/

part is fixed) So I wrote the regular expression (can make you laugh): 部分是固定的)所以我写了正则表达式(可以让你发笑):

~m#<div class="name"><a href="(/v/.*?)">#

It will be very helpful if you correct my stupid code. 如果您更正我的愚蠢代码,这将非常有帮助。

Using a robust HTML parser (see http://htmlparsing.com/ for why): 使用功能强大的HTML解析器 (有关原因,请参见http://htmlparsing.com/ ):

use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
HTML

my @v_links = $w->find('div.name > a[href^="/v/"]')->attr('href');

There are plenty of Perl modules that extract links from HTML. 有很多Perl模块可以从HTML中提取链接。 WWW::Mechanize , Mojo::DOM , HTML::LinkExtor , and HTML::SimpleLinkExtor can do it. WWW :: MechanizeMojo :: DOMHTML :: LinkExtorHTML :: SimpleLinkExtor都可以做到这一点。

Web scraping with Mojolicious is probably simplest way to do it in Perl nowadays 使用Mojolicious进行Web抓取可能是当今在Perl中最简单的方法

http://mojolicio.us/perldoc/Mojolicious/Guides/Cookbook#Web_scraping http://mojolicio.us/perldoc/Mojolicious/Guides/Cookbook#Web_scraping

You should not use regex for parsing HTML , as there are many libraries for such parsing. 您不应该使用regex来解析HTML ,因为有很多库可以进行此类解析。

Daxim's answer is good example. 达西姆的答案就是一个很好的例子。


However if you want to use regex anyway and you have your text assigned to $_ , then 但是,如果仍然要使用正则表达式,并且将文本分配给$_ ,则

my @list = m{<div class="name"><a href="(/v/.*?)">}g;

will get you a list of all findings. 将为您提供所有发现的列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM