简体   繁体   中英

Highlighting glossary terms inside a HTML document

We have a glossary with up to 2000 terms (where each glossary term may consist of one, two or three words (either separated with whitespaces or a dash).

Now we are looking for a solution for highlighting all terms inside a (longer) HTML document (up to 100 KB of HTML markup) in order to generate a static HTML page with the highlighted terms.

The constraints for a working solution are: large number of glossary terms and long HTML documents...what would be the blueprint for an efficient solution (within Python).

Right now I am thinking about parsing the HTML document using lxml, iterating over all text nodes and then matching the contents within each text node against all glossary terms.

Client-side (browser) highlighting on the fly is not an option since IE will complain about long running scripts with a script timeout...so unusable for production use.

Any better idea?

You could use a parser to navigate your tree in a recursive manner and replace only tags that are made of text.
In doing so, there are still several things you will need to account for:
- Not all text needs to be replaced (ex. Inline javascript)
- Some elements of the document might not need parsing (ex. Headings, etc.)

Here's a quick and non-production ready example of how you could achieve this :

html = """The HTML you need to parse"""
import BeautifulSoup

IGNORE_TAGS = ['script', 'style']

def parse_content(item, replace_what, replace_with, ignore_tags = IGNORE_TAGS):
    for content in item.contents:
        if isinstance(content, BeautifulSoup.NavigableString):
            content.replaceWith(content.replace(replace_what, replace_with, ignore_tags))
        else:
            if content.name not in ignore_tags:
                parse_content(content, replace_what, replace_with, ignore_tags)
    return item

soup = BeautifulSoup.BeautifulSoup(html)
body = soup.html.body
replaced_content = parse_content(body, 'a', 'b')

This should replace any occurence of an "a" with a "b", however leaving content that is:
- Inside inline javascript or css (Although inline JS or CSS should not appear in a document's body).
- A reference in a tag such as img, a...
- A tag itself

Of course, you will then need, depending on your glossary, to make sure that you don't replace only part of a word with something else ; to do this it makes sense to use regex insted of content.replace.

I think highlighting with client-side javascript is the best option. It saves your server processing time and bandwidth, and more important, keeps html clean and usable for those who don't need unnecessary markup, for example, when printing or converting to other formats.

To avoid timeouts, just split the job into chunks and process them one by one in a setTimeout'ed threaded function. Here's an example of this approach

function hilite(terms, chunkSize) {

    // prepare stuff

    var terms = new RegExp("\\b(" + terms.join("|") + ")\\b", "gi");

    // collect all text nodes in the document

    var textNodes = [];
    $("body").find("*").contents().each(function() {
        if (this.nodeType == 3)
            textNodes.push(this)
    });

    // process N text nodes at a time, surround terms with text "markers"

    function step() {
        for (var i = 0; i < chunkSize; i++) {
            if (!textNodes.length)
                return done();
            var node = textNodes.shift();
            node.nodeValue = node.nodeValue.replace(terms, "\x1e$&\x1f");
        }
        setTimeout(step, 100);
    }

    // when done, replace "markers" with html

    function done() {
        $("body").html($("body").html().
            replace(/\x1e/g, "<b>").
            replace(/\x1f/g, "</b>")
        );
    }

    // let's go

    step()
}

Use it like this:

$(function() {
    hilite(["highlight", "these", "words"], 100)
})

Let me know if you have questions.

How about going through each term in the glossary and then, for each term, using regex to find all occurrences in the HTML? You could replace each of those occurrences with the term wrapped in a span with a class "highlighted" that will be styled to have a background color.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM