简体   繁体   English

用链接替换单词

[英]Replacing Words with Links

On my site I have the Catholic Encyclopedia. 在我的网站上,我有天主教百科全书。 It has over 11,000 articles. 它拥有超过11,000篇文章。

I'm interested in replacing words and phrases on the articles on my site with links to the relevant entries in the Catholic Encyclopedia. 我有兴趣用指向《天主教百科全书》中相关条目的链接替换我网站上文章中的单词和短语。 So, if someone says: 因此,如果有人说:

St. Peter was the first pope. 圣彼得是第一任教皇。

It should replace St. Peter with a link to the article on St. Peter, and pope with a link to the article on the Pope. 它应使用指向圣彼得的文章的链接替换圣彼得,并使用指向教皇的文章的链接替换教皇。

I have it working, but it is very slow. 我有它的工作,但它非常慢。 There are over 30,000 possible replacements, so it is important to optimize. 有超过30,000种可能的替代品,因此进行优化非常重要。 I'm just not sure where to go from here. 我只是不确定从这里去哪里。

Here's my existing code. 这是我现有的代码。 Note that it's using Drupal. 请注意,它使用的是Drupal。 Also, it replaces the words with a [cathenlink] tag, and that tag is replaced by a real HTML link later in the code. 另外,它用[cathenlink]标记替换单词,该标记稍后在代码中由真实的HTML链接替换。

function ce_execute_filter($text)
{

    // If text is empty, return as-is
    if (!$text) {
        return $text;
    }

    // Split by paragraph
    $lines = preg_split('/\n+/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

    // Contains the parsed and linked text
    $linked_text = '';

    foreach ($lines as $line)
    {

        // If this fragment is only one or more newline characters,
        // Add it to $linked_text and continue without parsing
        if (preg_match('/^\n+$/', $line)) {
            $linked_text .= $line;
            continue;
        }

        // Select any terms that might be in this line
        // Ordered by descending length of term,
        // so that the longest terms get replaced first
        $result = db_query('SELECT title, term FROM {catholic_encyclopedia_terms} ' .
                "WHERE :text LIKE CONCAT('%', CONCAT(term, '%')) " .
                'GROUP BY term ' .
                'ORDER BY char_length(term) DESC',
                array(
                    ':text' => $line
                    ))
            ->fetchAll();

        // Array with lowercase term as key, title of entry as value
        $terms = array();

        // Array of the terms only in descending order of length
        $ordered_terms = array();

        foreach ($result as $r)
        {
            $terms[strtolower($r->term)] = $r->title;
            $ordered_terms[] = preg_quote($r->term);
        }

        // If no terms were returned, add the line and continue without parsing.
        if (empty($ordered_terms)) {
            $linked_text .= $line;
            continue;
        }

        // Do the replace
        // Get the regexp by joining $ordered_terms with |
        $line = preg_replace_callback('/\b('.
                    implode('|', $ordered_terms) .
                    ')\b/i', function ($matches) use($terms)
                {
                if ($matches[1]) {
                return "[cathenlink=" .
                $terms[strtolower($matches[1])] . "]" .
                $matches[1] . "[/cathenlink]";
                }
                },
                $line);

        $linked_text .= $line;
    }

    return $linked_text;
}

I'm doing the preg_replace like this so that it doesn't replace a word twice. 我正在像这样做preg_replace,这样它就不会两次替换一个单词。 I would use strtr, but then there's no way to ensure it is a full word and not just part of a word. 我会使用strtr,但是没有办法确保它是完整的单词,而不仅仅是单词的一部分。

Is there any way to make this faster? 有什么办法可以使速度更快? Right now it is pretty slow. 现在,它非常慢。

I think the LIKE keyword is slowing you down. 我认为LIKE关键字使您放慢速度。 Is it indexed ? indexed吗?

You can find some clues here 你可以在这里找到一些线索

You could use an indexing system such as Lucene to index the Catholic Encyclopedia. 您可以使用诸如Lucene之类的索引系统来索引《天主教百科全书》。 I don't suspect it changes very often, so indexing can be updated on a daily bassis. 我不怀疑它会经常变化,因此索引可以在每天的低音中进行更新。 Lucene is written in Java, but I know that Zend has a PHP module that can read the index. Lucene用Java编写,但是我知道Zend有一个PHP模块可以读取索引。

OK, I think the way I am doing it is probably the most efficient. 好的,我认为我的做法可能是最有效的。 What I came up with is to cache the results for one week, so that posts don't have to be parsed more than once per week. 我想到的是将结果缓存一周,这样就不必每周对帖子进行多次解析。 Implementing this solution, I've seen a marked improvement in speed on my site, so it seems to be working. 实施此解决方案后,我发现自己网站的速度有了显着提高,因此似乎可以正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM