在HTML文档中突出显示词汇表术语

Question

We have a glossary with up to 2000 terms (where each glossary term may consist of one, two or three words (either separated with whitespaces or a dash). 我们有一个词汇表，最多可包含2000个术语（其中每个术语词可能包含一个，两个或三个词（用空格或破折号分隔）。

Now we are looking for a solution for highlighting all terms inside a (longer) HTML document (up to 100 KB of HTML markup) in order to generate a static HTML page with the highlighted terms. 现在，我们正在寻找一种突出显示（较长）HTML文档（最多100 KB的HTML标记）中的所有术语的解决方案，以生成带有突出显示的术语的静态HTML页面。

The constraints for a working solution are: large number of glossary terms and long HTML documents...what would be the blueprint for an efficient solution (within Python). 一个有效的解决方案的约束是：大量的词汇表术语和冗长的HTML文档...有效的解决方案（在Python中）的蓝图是什么。

Right now I am thinking about parsing the HTML document using lxml, iterating over all text nodes and then matching the contents within each text node against all glossary terms. 现在，我正在考虑使用lxml解析HTML文档，遍历所有文本节点，然后将每个文本节点中的内容与所有词汇表术语进行匹配。

Client-side (browser) highlighting on the fly is not an option since IE will complain about long running scripts with a script timeout...so unusable for production use. 客户端（浏览器）突出显示不是一个可行的选择，因为IE会抱怨长时间运行的脚本以及脚本超时...因此无法用于生产环境。

Any better idea? 有更好的主意吗？

Answer 1

You could use a parser to navigate your tree in a recursive manner and replace only tags that are made of text. 您可以使用解析器以递归方式导航树并仅替换由文本组成的标签。
In doing so, there are still several things you will need to account for: 这样做时，您仍然需要考虑几件事：
- Not all text needs to be replaced (ex. Inline javascript) -并非所有文字都需要替换（例如内联javascript）
- Some elements of the document might not need parsing (ex. Headings, etc.) -文档的某些元素可能不需要解析（例如标题等）

Here's a quick and non-production ready example of how you could achieve this : 这是一个快速且非生产就绪的示例，说明如何实现此目标：

html = """The HTML you need to parse"""
import BeautifulSoup

IGNORE_TAGS = ['script', 'style']

def parse_content(item, replace_what, replace_with, ignore_tags = IGNORE_TAGS):
    for content in item.contents:
        if isinstance(content, BeautifulSoup.NavigableString):
            content.replaceWith(content.replace(replace_what, replace_with, ignore_tags))
        else:
            if content.name not in ignore_tags:
                parse_content(content, replace_what, replace_with, ignore_tags)
    return item

soup = BeautifulSoup.BeautifulSoup(html)
body = soup.html.body
replaced_content = parse_content(body, 'a', 'b')

This should replace any occurence of an "a" with a "b", however leaving content that is: 这应该用“ b”替换出现的“ a”，但是保留以下内容：
- Inside inline javascript or css (Although inline JS or CSS should not appear in a document's body). -内联javascript或CSS（尽管内联JS或CSS不应出现在文档正文中）。
- A reference in a tag such as img, a... -标记中的参考，例如img，...
- A tag itself -标签本身

Of course, you will then need, depending on your glossary, to make sure that you don't replace only part of a word with something else ; 当然，然后，根据词汇表，您需要确保不要仅将单词的一部分替换为其他单词； to do this it makes sense to use regex insted of content.replace. 为此，使用content.replace的正则表达式很有意义。

Answer 2

I think highlighting with client-side javascript is the best option. 我认为用客户端JavaScript突出显示是最好的选择。 It saves your server processing time and bandwidth, and more important, keeps html clean and usable for those who don't need unnecessary markup, for example, when printing or converting to other formats. 它可以节省您的服务器处理时间和带宽，更重要的是，可以使html保持干净并可供那些不需要不必要标记的人使用，例如，在打印或转换为其他格式时。

To avoid timeouts, just split the job into chunks and process them one by one in a setTimeout'ed threaded function. 为了避免超时，只需将作业拆分为多个块，然后在setTimeout的线程化函数中逐个处理它们。 Here's an example of this approach 这是这种方法的一个例子

function hilite(terms, chunkSize) {

    // prepare stuff

    var terms = new RegExp("\\b(" + terms.join("|") + ")\\b", "gi");

    // collect all text nodes in the document

    var textNodes = [];
    $("body").find("*").contents().each(function() {
        if (this.nodeType == 3)
            textNodes.push(this)
    });

    // process N text nodes at a time, surround terms with text "markers"

    function step() {
        for (var i = 0; i < chunkSize; i++) {
            if (!textNodes.length)
                return done();
            var node = textNodes.shift();
            node.nodeValue = node.nodeValue.replace(terms, "\x1e$&\x1f");
        }
        setTimeout(step, 100);
    }

    // when done, replace "markers" with html

    function done() {
        $("body").html($("body").html().
            replace(/\x1e/g, "<b>").
            replace(/\x1f/g, "</b>")
        );
    }

    // let's go

    step()
}

Use it like this: 像这样使用它：

$(function() {
    hilite(["highlight", "these", "words"], 100)
})

Let me know if you have questions. 如果您有任何问题，请告诉我。

Answer 3

How about going through each term in the glossary and then, for each term, using regex to find all occurrences in the HTML? 如何遍历词汇表中的每个术语，然后对每个术语使用正则表达式查找HTML中所有出现的内容？ You could replace each of those occurrences with the term wrapped in a span with a class "highlighted" that will be styled to have a background color. 您可以将这些事件中的每一个都替换为用“突出显示”类包裹在范围中的术语，该类将被设置为具有背景色。

在HTML文档中突出显示词汇表术语

问题描述

3 个解决方案

解决方案1
2 2011-12-03 11:41:54

解决方案2
0 2011-12-03 12:39:30

解决方案3
-1 2011-12-03 10:18:18

在HTML文档中突出显示词汇表术语

问题描述

3 个解决方案

解决方案1 2 2011-12-03 11:41:54

解决方案2 0 2011-12-03 12:39:30

解决方案3 -1 2011-12-03 10:18:18

解决方案1
2 2011-12-03 11:41:54

解决方案2
0 2011-12-03 12:39:30

解决方案3
-1 2011-12-03 10:18:18