如何仅从HTML表中提取文本并忽略标记？

Question

I have documents with HTML Tables. 我有HTML表格的文档。 Some of the cells have only numbers. 有些单元格只有数字。 Other cells have numbers and words. 其他单元格有数字和单词。

Is there any way to keep just the contents of the cells that have words and not keep the contents of cells with only numbers? 有没有办法只保留有单词的单元格的内容，而不是只保留单元格的内容？

Is there a module that anyone is aware of that I could use to do this? 有没有人知道我可以用来做这个的模块？ Alternatively, is there anyway I could use a regular expression? 或者，无论如何我可以使用正则表达式吗？

<table>
<tr>
<td>WORDS WORDS WORDS WORDS WORDS WORDS 123</td>
<td> 789</td>
</tr>
<tr>
<td> 123 </td>
<td>WORDS WORDS</td>
</tr>
</table>

I am still pretty new to perl, so please excuse my question if it is very simple. 我仍然是perl的新手，所以请原谅我的问题，如果它很简单。 Also, I have already been warned about the potential problems of parsing HTML text using a regular expression. 此外，我已经被警告过使用正则表达式解析HTML文本的潜在问题。

Thanks so much! 非常感谢！

Eventually, I'll use a module to kill all of the HTML code, by the way. 最后，顺便说一下，我将使用一个模块来杀死所有的HTML代码。

Answer 1

As you already stated, HTML should not be parsed with regular expressions. 如您所述，不应使用正则表达式解析HTML。 A specialised parsing module like HTML::Parser can be of help: 像HTML::Parser这样的专用解析模块可以提供帮助：

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::Parser;

my $p = HTML::Parser->new( 'text_h' => [ \&text_handler, 'dtext' ] );
$p->parse_file(\*DATA);

sub text_handler {
    my $text = shift;
    $text =~ s/^\s*|\s*$//g;         # Trim leading and trailing whitespaces
    return if !$text || $text =~ /^[\d\s]+$/;

    print "$text\n";
}

__DATA__
<table>
<tr>
<td>WORDS WORDS WORDS WORDS WORDS WORDS 123</td>
<td> 789 558 </td>
</tr>
<tr>
<td> 123 </td>
<td>WORDS WORDS</td>
</tr>
</table>

Output: 输出：

WORDS WORDS WORDS WORDS WORDS WORDS 123
WORDS WORDS

Answer 2

There are several modules that you can use to do this, I'd go with HTML::TreeBuilder::XPath myself. 有几个模块可以用来做这个，我自己去HTML::TreeBuilder::XPath 。

#!/usr/bin/env perl

use v5.12;
use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file("data.html");

my @cells = $tree->findnodes('//td');
foreach my $cell (@cells) {
    if ($cell->as_text =~ /^[0-9 ]+$/) {
        $cell->delete_content;
    }
}
print $tree->as_HTML;

The XPath engine used is supposed to support an extension to XPath that allows regular expressions (which would allow us to eliminate the test in the loop below). 使用的XPath引擎应该支持XPath的扩展，允许使用正则表达式（这将允许我们在下面的循环中消除测试）。 My XPath chops aren't up to getting it working in the time I have available to me now though. 我的XPath印章不能让我现在可以使用它。

#my @cells = $tree->findnodes( '//td[text() =~ /^[0-9 ]$/')->[0];

如何仅从HTML表中提取文本并忽略标记？

问题描述

2 个解决方案

解决方案1
2 已采纳 2012-08-17 06:16:36

解决方案2
2 2012-08-17 06:27:55

如何仅从HTML表中提取文本并忽略标记？

问题描述

2 个解决方案

解决方案1 2 已采纳 2012-08-17 06:16:36

解决方案2 2 2012-08-17 06:27:55

解决方案1
2 已采纳 2012-08-17 06:16:36

解决方案2
2 2012-08-17 06:27:55