I have documents with HTML Tables. Some of the cells have only numbers. Other cells have numbers and words.
Is there any way to keep just the contents of the cells that have words and not keep the contents of cells with only numbers?
Is there a module that anyone is aware of that I could use to do this? Alternatively, is there anyway I could use a regular expression?
<table>
<tr>
<td>WORDS WORDS WORDS WORDS WORDS WORDS 123</td>
<td> 789</td>
</tr>
<tr>
<td> 123 </td>
<td>WORDS WORDS</td>
</tr>
</table>
I am still pretty new to perl, so please excuse my question if it is very simple. Also, I have already been warned about the potential problems of parsing HTML text using a regular expression.
Thanks so much!
Eventually, I'll use a module to kill all of the HTML code, by the way.
As you already stated, HTML should not be parsed with regular expressions. A specialised parsing module like HTML::Parser
can be of help:
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::Parser;
my $p = HTML::Parser->new( 'text_h' => [ \&text_handler, 'dtext' ] );
$p->parse_file(\*DATA);
sub text_handler {
my $text = shift;
$text =~ s/^\s*|\s*$//g; # Trim leading and trailing whitespaces
return if !$text || $text =~ /^[\d\s]+$/;
print "$text\n";
}
__DATA__
<table>
<tr>
<td>WORDS WORDS WORDS WORDS WORDS WORDS 123</td>
<td> 789 558 </td>
</tr>
<tr>
<td> 123 </td>
<td>WORDS WORDS</td>
</tr>
</table>
Output:
WORDS WORDS WORDS WORDS WORDS WORDS 123
WORDS WORDS
There are several modules that you can use to do this, I'd go with HTML::TreeBuilder::XPath
myself.
#!/usr/bin/env perl
use v5.12;
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file("data.html");
my @cells = $tree->findnodes('//td');
foreach my $cell (@cells) {
if ($cell->as_text =~ /^[0-9 ]+$/) {
$cell->delete_content;
}
}
print $tree->as_HTML;
The XPath engine used is supposed to support an extension to XPath that allows regular expressions (which would allow us to eliminate the test in the loop below). My XPath chops aren't up to getting it working in the time I have available to me now though.
#my @cells = $tree->findnodes( '//td[text() =~ /^[0-9 ]$/')->[0];
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.