[英]How can I extract data from HTML tables in Perl?
我正在尝试在 Perl 中使用正则表达式来解析具有以下结构的表。 第一行如下:
<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>
在这里,我希望取出“播放时间”、“艺术家”、“标题”和“标签”,并将它们打印到输出文件中。
我尝试了许多正则表达式,例如:
$lines =~ / (<td>) /
OR
$lines =~ / <td>(.*)< /
OR
$lines =~ / >(.*)< /
我当前的程序如下所示:
#!perl -w
open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;
open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;
my $lines = join '', <INPUT_FILE>;
print "Hello 2\n";
if ($lines =~ / (\S.*\S) /) {
print "this is 1: \n";
print $1;
if ($lines =~ / <td>(.*)< / ) {
print "this is the 2nd 1: \n";
print $1;
print "the word was: $1.\n";
$Time = $1;
print $Time;
print OUTPUT_FILE $Time;
} else {
print "2ND IF FAILED\n";
}
} else {
print "THIS FAILED\n";
}
close(INPUT_FILE);
close(OUTPUT_FILE);
不要使用正则表达式来解析 HTML。 有大量的 CPAN 模块可以更有效地为您做到这一点。
使用HTML::TableExtract 。 真的。
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
my $file = 'Table3.htm';
unless ( -e $file ) {
my $rc = getstore(
'http://www.ntsb.gov/aviation/Table3.htm',
$file);
die "Failed to download document\n" unless $rc == 200;
}
my @headers = qw( Year Fatalities );
my $te = HTML::TableExtract->new(
headers => \@headers,
attribs => { id => 'myTable' },
);
$te->parse_file($file);
my ($table) = $te->tables;
print join("\t", @headers), "\n";
for my $row ($te->rows ) {
print join("\t", @$row), "\n";
}
这就是我在另一篇文章中“特定任务”HTML 解析器的意思。
您可以通过将精力集中在阅读一些文档而不是将正则表达式扔在墙上并查看是否有任何问题来节省大量时间。
这是一个简单的:
my $html = '<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>';
my @stuff = $html =~ />([^<]+)</g;
print join (", ", @stuff), "\n";
如果您想尝试运行它,请参阅http://codepad.org/qz9d5Bro 。
my $html = '<tr><td>Time Played</td><td>Artist</td><td>Title</td><td>Label</td></tr>
<tr><td>Time Played 2</td><td>Artist 2</td><td>Title 2</td><td>Label 2</td></tr>';
my @stuff = $html =~ />([^<]+)</g;
print join (", ", @stuff), "\n";
print $stuff[5];
How to get all the columns and put them in a report like below: Time played Artist Title Label Time played 2 Artist 2 Title 2 Label 2 Also, shouldn't $stuff[4] be "Time played 2"? How come $stuff[5] is "Time played 2"?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.