如何匹配HTML中不在标签内的文本？

Question

Given a string like this: 给出这样的字符串：

<a href="http://blah.com/foo/blah">This is the foo link</a>

... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. ...和像“foo”这样的搜索字符串，我想强调HTML文本中出现的所有“foo” - 但不是在标记内。 In other words, I want to get this: 换句话说，我想得到这个：

<a href="http://blah.com/foo/blah">This is the <b>foo</b> link</a>

However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href. 但是，简单的搜索和替换将不起作用，因为它将匹配<a>标记的href中的部分URL。

So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags? 因此，要以问题的形式表达上述内容：如何限制正则表达式以使其仅匹配HTML标记之外的文本？

Note: I promise that the HTML in question will never be anything pathological like: 注意：我保证有问题的HTML永远不会像任何病态一样：

<img title="Haha! Here are some angle brackets to screw you up: ><" />

Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. 编辑：是的，当然我知道CPAN中有复杂的库可以解析最令人发指的HTML，从而减少了对这种正则表达式的需求。 On many occasions, that's what I would use. 在很多场合，这就是我会用的。 However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. 但是，这不是其中之一，因为保持此脚本简短而没有外部依赖性非常重要。 I just want a one-line regex. 我只想要一行正则表达式。

Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. 编辑2：同样，我知道Template :: Refine :: Fragment可以解析我的所有HTML。 If I were writing an application I would certainly use a solution like that. 如果我正在编写应用程序，我肯定会使用这样的解决方案。 But this isn't an application. 但这不是一个应用程序。 It's barely more than a shell script. 它只不过是一个shell脚本。 It's a piece of disposable code. 这是一个一次性代码。 Being a single, self-contained file that can be passed around is of great value in this case. 在这种情况下，作为一个可以传递的单个自包含文件非常有价值。 "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything." “嘿，运行这个程序”是比一个简单得多的指令，“嘿，安装一个Perl模块然后运行它 - 等等，什么，你以前从未使用过CPAN？好的，运行perl -MCPAN -e shell（最好作为根）然后它会问你一堆问题，但你真的不需要回答它们。不，不要害怕，这不会破坏任何东西。看，你不需要仔细回答每一个问题 - 只要一遍又一遍地进入。不，我保证，它不会破坏任何东西。“

Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface. 现在将上述内容扩展到大量用户，他们想知道为什么他们一直使用的简单脚本不再那么简单，当所有改变的是使搜索词变为粗体时。

So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. 因此，虽然Template :: Refine :: Fragment可能是其他人的HTML解析问题的答案，但这不是这个问题的答案。 I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse. 我只想要一个正则表达式，它适用于非常有限的HTML子集，实际上会要求脚本进行解析。

Answer 1

如果您可以绝对保证HTML中没有尖括号，而不是用于打开和关闭标记的尖括号，这应该有效：

s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g

Answer 2

In general, you want to parse the HTML into a DOM, and then traverse the text nodes. 通常，您希望将HTML解析为DOM，然后遍历文本节点。 I would use Template::Refine for this: 我会使用Template :: Refine：

#!/usr/bin/env perl

use strict;
use warnings;
use feature ':5.10';

use Template::Refine::Fragment;

my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world.  <a href="http://foo.com/">This is a test of foo finding.</a>  Here is another foo.');

say $frag->process(
    simple_replace {
        my $n = shift;
        my $text = $n->textContent;
        $text =~ s/foo/<foo>/g;
        return XML::LibXML::Text->new($text);
    } '//text()',
)->render;

This outputs: 这输出：

<p>Hello, world.  <a href="http://foo.com/">This is a test of &lt;foo&gt; finding.</a>  Here is another &lt;foo&gt;.</p>

Anyway, don't parse structured data with regular expressions. 无论如何，不要使用正则表达式解析结构化数据。 HTML is not "regular", it's "context-free". HTML不是“常规”，而是“无上下文”。

Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". 编辑：最后，如果你在程序中生成HTML，你必须在字符串上进行这样的转换，“UR DOIN IT WONG”。 You should build a DOM, and only serialize it when everything has been transformed. 您应该构建一个DOM，并且只在转换完所有内容后对其进行序列化。 (You can still use TR, however, via the new_from_dom constructor.) （但是，您仍然可以通过new_from_dom构造函数使用TR。）

Answer 3

The following regex will match all text between tags or outside of tags: 以下正则表达式将匹配标记之间或标记之外的所有文本：

<.*?>(.*?)<.*?>|>(.*?)<

Then you can operate on that as desired. 然后你可以根据需要操作它。

Answer 4

Try this one 试试这个吧

(?=>)?(\\w[^>]+?)(?=<)

it matches all words between tags 它匹配标签之间的所有单词

Answer 5

To strip off the variable size contents from even nested tags you can use this regex that is in fact a mini-regular grammar for that. 要从甚至嵌套的标签中剥离可变大小的内容，您可以使用此正则表达式实际上是一个迷你常规语法。 (note: PCRE machine) （注意：PCRE机器）

(?<=>)((?:\\w+)(?:\\s*))(?1)* （<=>？）（（？：\\ W +）（？：\\ S *））（α1）*

如何匹配HTML中不在标签内的文本？

问题描述

5 个解决方案

解决方案1
10 已采纳 2009-02-22 04:26:05

解决方案2
7 2009-02-22 04:15:06

解决方案3
2 2009-02-22 04:29:38

解决方案4
0 2012-06-20 09:44:19

解决方案5
0 2014-05-27 07:36:02

如何匹配HTML中不在标签内的文本？

问题描述

5 个解决方案

解决方案1 10 已采纳 2009-02-22 04:26:05

解决方案2 7 2009-02-22 04:15:06

解决方案3 2 2009-02-22 04:29:38

解决方案4 0 2012-06-20 09:44:19

解决方案5 0 2014-05-27 07:36:02

解决方案1
10 已采纳 2009-02-22 04:26:05

解决方案2
7 2009-02-22 04:15:06

解决方案3
2 2009-02-22 04:29:38

解决方案4
0 2012-06-20 09:44:19

解决方案5
0 2014-05-27 07:36:02