HTML的Perl正則表達式

Question

我需要為變量URL指定的電影提取IMDB id（例如：電影300為tt0416449）。 我已經看過這個頁面的源代碼頁，並拿出以下的正則表達式

use LWP::Simple;
$url = "http://www.imdb.com/search/title?title=$FORM{'title'}";

if (is_success( $content = LWP::Simple::get($url) ) ) {
    print "$url is alive!\n";
} else {
    print "No movies found";
}

$code = "";

if ($content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s) {
    $code = $1;
}

我在此行收到內部服務器錯誤

$content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s

我是perl的新手，如果有人可以指出我的錯誤，將不勝感激。

Answer 1

使用HTML解析器。 正則表達式無法解析HTML。

無論如何，該錯誤的原因可能是您忘記了正則表達式中的正斜杠。 它看起來應該像這樣：

/<td class="number">1\.<\/td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s

Answer 2

Mojolicious發行版的某些工具為此類工作提供了一個非常不錯的界面。

長版

它的UserAgent ， DOM和URL類的組合可以非常可靠地工作：

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;
use Mojo::URL;

# preparations
my $ua  = Mojo::UserAgent->new;
my $url = "http://www.imdb.com/search/title?title=Casino%20Royale";

# try to load the page
my $tx = $ua->get($url);

# error handling
die join ', ' => $tx->error unless $tx->success;

# extract the url
my $movie_link  = $tx->res->dom('a[href^=/title]')->first;
my $movie_url   = Mojo::URL->new($movie_link->attrs('href'));
say $movie_url->path->parts->[-1];

輸出：

tt0381061

簡潔版本

有趣的一個划線員助手模塊ojo幫助構建了一個非常簡短的版本：

$ perl -Mojo -E 'say g("imdb.com/search/title?title=Casino%20Royale")->dom("a[href^=/title]")->first->attrs("href") =~ m|([^/]+)/?$|'

輸出：

tt0381061

Answer 3

我同意XML是反行編輯的，因此是反Unix的，但是有AWK。

如果awk可以做到，perl肯定可以做到。 我可以產生一個清單：

curl -s 'http://www.imdb.com/find?q=300&s=all' | awk -vRS='<a|</a>' -vFS='>|"' -vID=$1 '

$NF ~ ID && /title/ { printf "%s\t", $NF; match($2, "/tt[0-9]+/"); print substr($2, RSTART+1, RLENGTH-2)}
' | uniq

將搜索字符串傳遞給“ ID”。 基本上，所有關於如何在awk中選擇標記器的問題，我都使用<a>標記。 在perl中應該更容易。

HTML的Perl正則表達式

問題描述

3 個解決方案

解決方案1
12 已采納 2012-10-23 05:26:29

解決方案2
3 2012-10-23 16:18:02

長版

簡潔版本

解決方案3
0 2012-10-23 06:27:30

HTML的Perl正則表達式

問題描述

3 個解決方案

解決方案1 12 已采納 2012-10-23 05:26:29

解決方案2 3 2012-10-23 16:18:02

長版

簡潔版本

解決方案3 0 2012-10-23 06:27:30

解決方案1
12 已采納 2012-10-23 05:26:29

解決方案2
3 2012-10-23 16:18:02

解決方案3
0 2012-10-23 06:27:30