简体   繁体   English

如何从 Perl 的 HTML 页面上的链接中提取文本?

[英]How can I extract the text from links on an HTML page in Perl?

I am not an expert in Perl but I have written a Perl script to parse an HTML page and filter by all href tags:我不是 Perl 专家,但我编写了一个 Perl 脚本来解析 HTML 页面并按所有href标签过滤:

The output are as shown below:输出如下所示:

href="?Name">Name</a>
href="?Desc">Hourly Details</a>
href="/24x7/2012/11-November/">Data
href="./00:00:00/">00:00:00/</a>
href="./01:00:00/">01:00:00/</a>
href="./02:00:00/">02:00:00/</a>
href="./03:00:00/">03:00:00/</a>
href="./04:00:00/">04:00:00/</a>
href="./05:00:00/">05:00:00/</a>
href="./06:00:00/">06:00:00/</a>
href="./07:00:00/">07:00:00/</a>
href="./08:00:00/">08:00:00/</a>
href="./09:00:00/">09:00:00/</a>
href="./10:00:00/">10:00:00/</a>
href="./11:00:00/">11:00:00/</a>
href="./12:00:00/">12:00:00/</a>
href="./13:00:00/">13:00:00/</a>
href="./14:00:00/">14:00:00/</a>
href="./15:00:00/">15:00:00/</a>
href="./16:00:00/">16:00:00/</a>
href="./17:00:00/">17:00:00/</a>
href="./18:00:00/">18:00:00/</a>
href="./19:00:00/">19:00:00/</a>
href="./20:00:00/">20:00:00/</a>
href="./21:00:00/">21:00:00/</a>
href="./22:00:00/">22:00:00/</a>
href="./23:00:00/">23:00:00/</a>

Now I want to extract values within the href tags from "00:00:00" till "23:00:00" while exclude others.现在我想从“00:00:00”到“23:00:00”提取href标签中的值,同时排除其他人。 The result value would be added to string having a URL:结果值将添加到具有 URL 的字符串中:

http://x.download.com/00:00:00
------URL------------/..href../
..............................
http://x.download.com/23:00:00

However by trying the below code:但是,通过尝试以下代码:

foreach (@tag) {
    if (m/href/) {
        if ($_ =~ /"\/24/ && $_ =~ /"\/[0-9]/) {
            my $href  = $_;
            my $start = index($href, "\"");
            my $end   = rindex($href, "\"");
            my $link  = substr($href, $start + 1, $end - $start - 1);
            print "Follow: " . $url . $link . "\n";

        }
    }
}

prints:印刷:

Follow: http://x.download.com/24x7/2012/11-November/

What should my regular expression be such that required objective can be achieved?我的正则表达式应该是什么才能实现所需的目标?

This is done very simply with a regular expression, as shown in the program below.这是使用正则表达式非常简单地完成的,如下面的程序所示。 It looks for a string of digits or colons immediately following > (and so looks for the text contents of the element rather the href attribute value as yours does) and captures that string into $1 .它查找紧跟在>的一串数字或冒号(因此查找元素的文本内容而不是您的href属性值)并将该字符串捕获到$1

But I would prefer to see the problem solved from start to finish using a proper HTML parser, such as HTML::TreeBuilder or Mojo::DOM .但我更愿意使用适当的 HTML 解析器(例如HTML::TreeBuilderMojo::DOM从头到尾解决问题。

use strict;
use warnings;

my @tag = <DATA>;

foreach (@tag) {
  next unless />([\d:]+)/;
  print "http://x.download.com/$1\n";
}

__DATA__
href="?Name">Name</a>
href="?Desc">Hourly Details</a>
href="/24x7/2012/11-November/">Data
href="./00:00:00/">00:00:00/</a>
href="./01:00:00/">01:00:00/</a>
href="./02:00:00/">02:00:00/</a>
href="./03:00:00/">03:00:00/</a>
href="./04:00:00/">04:00:00/</a>
href="./05:00:00/">05:00:00/</a>
href="./06:00:00/">06:00:00/</a>
href="./07:00:00/">07:00:00/</a>
href="./08:00:00/">08:00:00/</a>
href="./09:00:00/">09:00:00/</a>
href="./10:00:00/">10:00:00/</a>

output输出

http://x.download.com/00:00:00
http://x.download.com/01:00:00
http://x.download.com/02:00:00
http://x.download.com/03:00:00
http://x.download.com/04:00:00
http://x.download.com/05:00:00
http://x.download.com/06:00:00
http://x.download.com/07:00:00
http://x.download.com/08:00:00
http://x.download.com/09:00:00
http://x.download.com/10:00:00

You do not want to do it with regular expressions.你不想用正则表达式来做。 You need a proper HTML parser, and regexes cannot do the job.您需要一个合适的 HTML 解析器,而正则表达式无法完成这项工作。

How are you fetching the web page?你是如何获取网页的? If you're using WWW::Mechanize, then extracting the links from the page that you have fetched is a single method call, because WWW::Mechanize does the HTML parsing for you.如果您使用 WWW::Mechanize,那么从您获取的页面中提取链接是一个方法调用,因为 WWW::Mechanize 会为您进行 HTML 解析。

use WWW::Mechanize;

my $mech = WWW::Mechanize->new();
$mech->get( $url );

my @links = $mech->links();
for my $link ( @links ) {
    say $link->text, ' -> ', $link->url; # Show the text and the URL
}

You'll need to reformat as you see fit, but that gives you an idea.您需要根据需要重新格式化,但这给了您一个想法。

First of all, we need to specify a regex which will capture military times to the second.首先,我们需要指定一个将军事时间捕获到秒的正则表达式。

my $regex 
    = qr{  # curly brackets instead of slashes
           # so that we can use literal slashes in expression

    "   # a quote
    \.  # a literal dot
    /   # a forward slash
    (   # begin capture group

       (?:              # begin uncaptured sub-group
           [01] \d      # a '0' or '1' followed by a digit
       |   2    [0-3]   # a '2' followed by 0-3
       )                # end grouping
       (?:         # begin repetition grouping
         :         # a literal colon               
         [0-5] \d  # digits 0-5 followed by any digit
       ){2}        # exactly twice
     )  # end capture

     /  # a forward slash
     "  # close quote
}x; # <- x-option allows annotated regex
...

This is equivalent to the following regex:这等效于以下正则表达式:

my $regex = qr/"\.\/((?:[01]\d|2[0-3])(:[0-5]\d){2})\/"/;

If your minutes and seconds will only ever be '00:00', then the expression is even easier:如果您的分和秒永远只是“00:00”,那么表达式就更简单了:

my $regex = qr{"\./((?:[01]\d|2[0-3]):00:00)/"};

Then you can test and retrieve the value by making the match in a list context:然后您可以通过在列表上下文中进行匹配来测试检索值:

if ( my ( $link ) = m/$regex/ ) { 
    say "http://x.download.com/$link";
}

If the test does not match, $link will be undefined.如果测试不匹配,则$link将是未定义的。 If it does match, having declared it as a list (of one), the match operation will assign the first capture to the variable.如果它匹配,将它声明为一个列表(一个),匹配操作会将第一个捕获分配给变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM