如何使用Perl在字符串中删除HTML？

Question

Is there anyway easier than this to strip HTML from a string using Perl? 有没有比这更容易使用Perl从字符串中删除HTML？

$Error_Msg =~ s|<b>||ig;
$Error_Msg =~ s|</b>||ig;
$Error_Msg =~ s|<h1>||ig;
$Error_Msg =~ s|</h1>||ig;
$Error_Msg =~ s|<br>||ig;

I would appreicate both a slimmed down regular expression, eg something like this: 我会同时修饰一个精简的正则表达式，例如：

$Error_Msg =~ s|</?[b|h1|br]>||ig;

Is there an existing Perl function that strips any/all HTML from a string, even though I only need bolds, h1 headers and br stripped? 是否存在从字符串中删除任何/所有HTML的现有Perl函数，即使我只需要粗体，h1标题和br剥离？

Answer 1

Assuming the code is valid HTML (no stray < or > operators) 假设代码是有效的HTML（没有杂散的<或>运算符）

$htmlCode =~ s|<.+?>||g;

If you need to remove only bolds, h1's and br's 如果你只需要删除粗体，h1和br

$htmlCode =~ s#</?(?:b|h1|br)\b.*?>##g

And you might want to consider the HTML::Strip module 您可能想要考虑HTML :: Strip模块

Answer 2

From perlfaq9: How do I remove HTML from a string? 从perlfaq9：如何从字符串中删除HTML？

The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. 最正确的方法（尽管不是最快）是使用CPAN的HTML :: Parser。 Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text. 另一种最正确的方法是使用HTML :: FormatText，它不仅可以删除HTML，还可以尝试对生成的纯文本进行一些简单的格式化。

Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. 许多人尝试一种简单的正则表达式方法，比如s /<.*?> // g，但在许多情况下失败，因为标签可能会在换行符上继续，它们可能包含带引号的尖括号，或HTML注释可能出席。 Plus, folks forget to convert entities--like < for example. 此外，人们忘记转换实体 - 例如<。

Here's one "simple-minded" approach, that works for most files: 这是一个“简单的”方法，适用于大多数文件：

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz . 如果您需要更完整的解决方案，请参阅http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz中的3阶段striphtml程序。

Here are some tricky cases that you should think about when picking a solution: 以下是一些在选择解决方案时应该考虑的棘手案例：

<IMG SRC = "foo.gif" ALT = "A > B">

<IMG SRC = "foo.gif"
 ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this: 如果HTML注释包含其他标记，那么这些解决方案也会破坏文本，如下所示：

<!-- This section commented out.
    <B>You can't see me!</B>
-->

Answer 3

You should definitely have a look at the HTML::Restrict which allows you to strip away or restrict the HTML tags allowed. 您一定要看一下HTML :: Restrict ，它允许您去除或限制允许的HTML标记。 A minimal example that strips away all HTML tags: 剥离所有HTML标记的最小示例：

use HTML::Restrict;

my $hr = HTML::Restrict->new();
my $processed = $hr->process('<b>i am bold</b>'); # returns 'i am bold'

I would recommend to stay away from HTML::Strip because it breaks utf8 encoding . 我建议远离HTML :: Strip，因为它破坏了utf8编码。

如何使用Perl在字符串中删除HTML？

问题描述

3 个解决方案

解决方案1
21 已采纳 2009-07-01 05:31:04

解决方案2
14 2009-07-01 08:16:54

解决方案3
14 2011-03-03 13:09:35

如何使用Perl在字符串中删除HTML？

问题描述

3 个解决方案

解决方案1 21 已采纳 2009-07-01 05:31:04

解决方案2 14 2009-07-01 08:16:54

解决方案3 14 2011-03-03 13:09:35

解决方案1
21 已采纳 2009-07-01 05:31:04

解决方案2
14 2009-07-01 08:16:54

解决方案3
14 2011-03-03 13:09:35