如何從 Perl 中的 HTML 表中提取數據？

Question

我正在嘗試在 Perl 中使用正則表達式來解析具有以下結構的表。 第一行如下：

<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>

在這里，我希望取出“播放時間”、“藝術家”、“標題”和“標簽”，並將它們打印到輸出文件中。

我嘗試了許多正則表達式，例如：

$lines =~ / (<td>) /
       OR
$lines =~ / <td>(.*)< /
       OR
$lines =~ / >(.*)< /

我當前的程序如下所示：

#!perl -w

open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;

open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;

my $lines = join '', <INPUT_FILE>;

print "Hello 2\n";

if ($lines =~ / (\S.*\S) /) {
print "this is 1: \n";
print $1;
    if ($lines =~ / <td>(.*)< / ) {
    print "this is the 2nd 1: \n";
    print $1;
    print "the word was: $1.\n";
    $Time = $1;
    print $Time;
    print OUTPUT_FILE $Time;
    } else {
    print "2ND IF FAILED\n";
    }
} else { 
print "THIS FAILED\n";
}

close(INPUT_FILE);
close(OUTPUT_FILE);

Answer 1

不要使用正則表達式來解析 HTML。 有大量的 CPAN 模塊可以更有效地為您做到這一點。

Answer 2

使用HTML::TableExtract 。 真的。

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TableExtract;
use LWP::Simple;

my $file = 'Table3.htm';
unless ( -e $file ) {
    my $rc = getstore(
        'http://www.ntsb.gov/aviation/Table3.htm',
        $file);
    die "Failed to download document\n" unless $rc == 200;
}

my @headers = qw( Year Fatalities );

my $te = HTML::TableExtract->new(
    headers => \@headers,
    attribs => { id => 'myTable' },
);

$te->parse_file($file);

my ($table) = $te->tables;

print join("\t", @headers), "\n";

for my $row ($te->rows ) {
    print join("\t", @$row), "\n";
}

這就是我在另一篇文章中“特定任務”HTML 解析器的意思。

您可以通過將精力集中在閱讀一些文檔而不是將正則表達式扔在牆上並查看是否有任何問題來節省大量時間。

Answer 3

這是一個簡單的：

my $html = '<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>';
my @stuff = $html =~ />([^<]+)</g;
print join (", ", @stuff), "\n";

如果您想嘗試運行它，請參閱http://codepad.org/qz9d5Bro 。

Answer 4

my $html = '<tr><td>Time Played</td><td>Artist</td><td>Title</td><td>Label</td></tr>

<tr><td>Time Played 2</td><td>Artist 2</td><td>Title 2</td><td>Label 2</td></tr>';
my @stuff = $html =~ />([^<]+)</g;
print join (", ", @stuff), "\n";
print $stuff[5];

How to get all the columns and put them in a report like below:
    
    Time played     Artist     Title      Label
    Time played 2   Artist 2   Title 2    Label 2
    
Also, shouldn't $stuff[4] be "Time played 2"?  How come $stuff[5] is "Time played 2"?

如何從 Perl 中的 HTML 表中提取數據？

問題描述

3 個解決方案

解決方案1
17 2009-10-30 17:42:00

解決方案2
11 2009-10-30 19:43:13

解決方案3
0

解決方案4
-1 2022-07-17 04:35:54

如何從 Perl 中的 HTML 表中提取數據？

問題描述

3 個解決方案

解決方案1 17 2009-10-30 17:42:00

解決方案2 11 2009-10-30 19:43:13

解決方案3 0

解決方案4 -1 2022-07-17 04:35:54

解決方案1
17 2009-10-30 17:42:00

解決方案2
11 2009-10-30 19:43:13

解決方案3
0

解決方案4
-1 2022-07-17 04:35:54