简体   繁体   English

匹配多个词 <body> 标签

[英]Match multiple terms within <body> tags

I've want to match any occurrence of a search term (or list of search terms) within the tags of a document. 我想匹配文档标签中搜索词(或搜索词列表)的任何出现。 My current solution uses preg (within a Joomla plugin) 我当前的解决方案使用preg(在Joomla插件内)

$pattern = '/matchthisterm/i';
$article->text = preg_replace($pattern,"<span class=\"highlight\">\\0</span>",$article->text);

But this replaces everything within the HTML of the document so I need to match the tags first. 但这会替换文档HTML中的所有内容,因此我需要首先匹配标签。 Is this even the best way to achieve this? 这甚至是实现这一目标的最佳方法吗?

EDIT: OK, I've used simplehtmldom, but just need some help getting to the correct term. 编辑:好的,我已经使用了simplehtmldom,但是只需要一些帮助即可正确使用术语。 So far I've got: 到目前为止,我已经:

$pattern = '/(matchthisterm)/i';
$html = str_get_html($buffer);
$es = $html->find('text');
foreach ($es as $term) {
    //Match to the terms within the text nodes 
    if (preg_match($pattern, $term->plaintext)) {
        $term->outertext = '<span class="highlight">' . $term->outertext . '</span>';
    }
}

This makes the entire node text bold, am I ok to use the preg_replace in here? 这会使整个节点文本变为粗体,我可以在这里使用preg_replace吗?

SOLUTION: 解:

//Get the HTML and look at the text nodes
$html = str_get_html($buffer);
$es = $html->find('text');
foreach ($es as $term) {
    //Match to the terms within the text nodes
    $term->outertext = str_ireplace('matchthis', '<span class="highlight">matchthis</span>',         $term->outertext);
}

No, processing [X][HT]ML with regex is largely disastrous. 不,使用正则表达式处理[X] [HT] ML很大程度上是灾难性的。 In the simplest case for your example, this input: 在您的示例的最简单情况下,此输入:

<a href="/foo/matchthisterm/bar">bof</a>

gives quite thoroughly broken output: 给出了非常彻底的输出:

<a href="/foo/<span class="highlight">matchthisterm</span>/bar">bof</a>

The proper way to do it would be to use a proper HTML/XML parser (for example DOMDocument.loadHTML or simplehtmldom ), then scan and replace the contents of each text node separately. 正确的方法是使用适当的HTML / XML解析器(例如DOMDocument.loadHTMLsimplehtmldom ),然后分别扫描和替换每个文本节点的内容。 Finally re-save the HTML back to a string. 最后,将HTML重新保存为字符串。

An alternative for search term highlighting is to do it in JavaScript. 搜索字词突出显示的另一种方法是使用JavaScript。 Since the browser has already parsed the HTML to a DOM, that saves you a processing step. 由于浏览器已经将HTML解析为DOM,因此可以节省处理步骤。 See eg. 参见例如。 this question for an example. 这个问题为例。

I agree processing HTML with regex is not a good solution. 我同意使用正则表达式处理HTML不是一个好的解决方案。

I just read the argument about why regex can't parse HTML here: RegEx match open tags except XHTML self-contained tags 我只是在这里阅读了有关正则表达式为何无法解析HTML的论点: RegEx匹配除XHTML自包含标记之外的其他开放标记

I quite agree with the whole thing, but the problem is MUCH simpler here: we just need to know whether we are inside some HTML tag or not. 我完全同意这一点,但是问题在这里要简单得多:我们只需要知道我们是否在某个HTML标记内即可。 We don't have to parse an HTML structure and interpreting a tree and mismatching tags or some other errors. 我们不必解析HTML结构并解释树和不匹配的标签或其他错误。 We just know that a HTML tag is something between < and >. 我们只知道HTML标记介于<和>之间。 I believe the regex is a very good, adapted and consistent tool here. 我相信正则表达式是一个非常好的,适应性强且一致的工具。

It's not because we're dealing with some HTML that we don't want to use regex. 这不是因为我们正在处理一些我们不想使用正则表达式的HTML。 We need to focus on the real problem here, which I believe doesn't really process HTML. 在这里,我们需要关注真正的问题,我相信它实际上并没有处理HTML。 We only need to know whether we're inside a tag or not. 我们只需要知道我们是否在标签内即可。 I hope I won't get too much downvotes for this, but I completely assume my position. 我希望我不会对此表示过多反对,但我完全假设了自己的立场。

I'm redirecting you to a previous post (where you put a link to this topic) I made sooner this day: Highlight text, except html tags 我将您重定向到我今天早些时候发布的上一篇文章(您在其中放置了指向该主题的链接): 突出显示文本,但html标记除外

On the same idea, and I hope we know all we need to, you're using preg_replace() where a simpler function like str_ireplace() would be sufficient. 基于相同的想法,我希望我们知道我们所需要的,您正在使用preg_replace() ,其中一个简单的函数(例如str_ireplace()就足够了。 If you just need to replace a word (or a set of words) inside a string and deal with case insensivity, don't use regex. 如果只需要替换字符串中的一个单词(或一组单词)并处理不区分大小写的内容,请不要使用正则表达式。 Keep it simple. 把事情简单化。 (I'm assuming you didn't simplify the replacement you're trying to make on purpose to explain your problem here). (我假设您没有简化要刻意在此处解释您的问题的替换操作)。

I haven't used preg but I've done pattern matching in perl, java and actionscript before. 我没有使用过preg,但是之前我已经在perl,java和actionscript中完成了模式匹配。 If this is anything similar you have to escape special characters. 如果类似,则必须转义特殊字符。 For example "\\<span class... . I found a website that talks about using preg, in case you haven't come across this site, that can be found here 例如"\\<span class...我找到了一个网站,该网站谈论有关使用预浸料的情况,如果您还没有遇到过该网站,可以在这里找到

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM