How can I extract only the text from an HTML table and ignore the tags?

Question

I have documents with HTML Tables. Some of the cells have only numbers. Other cells have numbers and words.

Is there any way to keep just the contents of the cells that have words and not keep the contents of cells with only numbers?

Is there a module that anyone is aware of that I could use to do this? Alternatively, is there anyway I could use a regular expression?

<table>
<tr>
<td>WORDS WORDS WORDS WORDS WORDS WORDS 123</td>
<td> 789</td>
</tr>
<tr>
<td> 123 </td>
<td>WORDS WORDS</td>
</tr>
</table>

I am still pretty new to perl, so please excuse my question if it is very simple. Also, I have already been warned about the potential problems of parsing HTML text using a regular expression.

Thanks so much!

Eventually, I'll use a module to kill all of the HTML code, by the way.

Answer 1

As you already stated, HTML should not be parsed with regular expressions. A specialised parsing module like HTML::Parser can be of help:

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::Parser;

my $p = HTML::Parser->new( 'text_h' => [ \&text_handler, 'dtext' ] );
$p->parse_file(\*DATA);

sub text_handler {
    my $text = shift;
    $text =~ s/^\s*|\s*$//g;         # Trim leading and trailing whitespaces
    return if !$text || $text =~ /^[\d\s]+$/;

    print "$text\n";
}

__DATA__
<table>
<tr>
<td>WORDS WORDS WORDS WORDS WORDS WORDS 123</td>
<td> 789 558 </td>
</tr>
<tr>
<td> 123 </td>
<td>WORDS WORDS</td>
</tr>
</table>

Output:

WORDS WORDS WORDS WORDS WORDS WORDS 123
WORDS WORDS

Answer 2

There are several modules that you can use to do this, I'd go with HTML::TreeBuilder::XPath myself.

#!/usr/bin/env perl

use v5.12;
use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file("data.html");

my @cells = $tree->findnodes('//td');
foreach my $cell (@cells) {
    if ($cell->as_text =~ /^[0-9 ]+$/) {
        $cell->delete_content;
    }
}
print $tree->as_HTML;

The XPath engine used is supposed to support an extension to XPath that allows regular expressions (which would allow us to eliminate the test in the loop below). My XPath chops aren't up to getting it working in the time I have available to me now though.

#my @cells = $tree->findnodes( '//td[text() =~ /^[0-9 ]$/')->[0];

How can I extract only the text from an HTML table and ignore the tags?

Question

2 answers

solution1
2 ACCPTED 2012-08-17 06:16:36

solution2
2 2012-08-17 06:27:55

How can I extract only the text from an HTML table and ignore the tags?

Question

2 answers

solution1 2 ACCPTED 2012-08-17 06:16:36

solution2 2 2012-08-17 06:27:55

solution1
2 ACCPTED 2012-08-17 06:16:36

solution2
2 2012-08-17 06:27:55