HTML的Perl正则表达式

Question

I need to extract the IMDB id(example:for the movie 300 it is tt0416449) for a movie specified by the variable URL. 我需要为变量URL指定的电影提取IMDB id（例如：电影300为tt0416449）。 I have looked at the page source for this page and come up with the following regex 我已经看过这个页面的源代码页，并拿出以下的正则表达式

use LWP::Simple;
$url = "http://www.imdb.com/search/title?title=$FORM{'title'}";

if (is_success( $content = LWP::Simple::get($url) ) ) {
    print "$url is alive!\n";
} else {
    print "No movies found";
}

$code = "";

if ($content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s) {
    $code = $1;
}

I am getting an internal server error at this line 我在此行收到内部服务器错误

$content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s

I am very new to perl, and would be grateful if anyone could point out my mistake(s). 我是perl的新手，如果有人可以指出我的错误，将不胜感激。

Answer 1

Use an HTML parser . 使用HTML解析器。 Regular expressions cannot parse HTML. 正则表达式无法解析HTML。

Anyway, the reason for the error is probably that you forgot to escape a forward slash in your regex. 无论如何，该错误的原因可能是您忘记了正则表达式中的正斜杠。 It should look like this: 它看起来应该像这样：

/<td class="number">1\.<\/td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s

Answer 2

A very nice interface for this type of work is provided by some tools of the Mojolicious distribution. Mojolicious发行版的某些工具为此类工作提供了一个非常不错的界面。

Long version 长版

The combination of its UserAgent , DOM and URL classes can work in a very robust way: 它的UserAgent ， DOM和URL类的组合可以非常可靠地工作：

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;
use Mojo::URL;

# preparations
my $ua  = Mojo::UserAgent->new;
my $url = "http://www.imdb.com/search/title?title=Casino%20Royale";

# try to load the page
my $tx = $ua->get($url);

# error handling
die join ', ' => $tx->error unless $tx->success;

# extract the url
my $movie_link  = $tx->res->dom('a[href^=/title]')->first;
my $movie_url   = Mojo::URL->new($movie_link->attrs('href'));
say $movie_url->path->parts->[-1];

Output: 输出：

tt0381061

Short version 简洁版本

The funny one liner helper module ojo helps to build a very short version: 有趣的一个划线员助手模块ojo帮助构建了一个非常简短的版本：

$ perl -Mojo -E 'say g("imdb.com/search/title?title=Casino%20Royale")->dom("a[href^=/title]")->first->attrs("href") =~ m|([^/]+)/?$|'

Output: 输出：

tt0381061

Answer 3

I agree XML is anti-line-editing thus anti-unix but, there is AWK. 我同意XML是反行编辑的，因此是反Unix的，但是有AWK。

If awk can do, perl can surely do. 如果awk可以做到，perl肯定可以做到。 I can produce a list: 我可以产生一个清单：

curl -s 'http://www.imdb.com/find?q=300&s=all' | awk -vRS='<a|</a>' -vFS='>|"' -vID=$1 '

$NF ~ ID && /title/ { printf "%s\t", $NF; match($2, "/tt[0-9]+/"); print substr($2, RSTART+1, RLENGTH-2)}
' | uniq

Pass search string to "ID". 将搜索字符串传递给“ ID”。 Basically it's all about how you choose your tokenizer in awk, I use the <a> tag. 基本上，所有关于如何在awk中选择标记器的问题，我都使用<a>标记。 Should be easier in perl. 在perl中应该更容易。

HTML的Perl正则表达式

问题描述

3 个解决方案

解决方案1
12 已采纳 2012-10-23 05:26:29

解决方案2
3 2012-10-23 16:18:02

Long version 长版

Short version 简洁版本

解决方案3
0 2012-10-23 06:27:30

HTML的Perl正则表达式

问题描述

3 个解决方案

解决方案1 12 已采纳 2012-10-23 05:26:29

解决方案2 3 2012-10-23 16:18:02

Long version 长版

Short version 简洁版本

解决方案3 0 2012-10-23 06:27:30

解决方案1
12 已采纳 2012-10-23 05:26:29

解决方案2
3 2012-10-23 16:18:02

解决方案3
0 2012-10-23 06:27:30