如何从大量文本中获取最流行的短语？

Question

我正在为我的论坛设置一个推特风格的“趋势主题”框。 我有最受欢迎/单词/，但甚至不能开始考虑如何获得流行的短语，如Twitter。

就目前而言，我只是将最后200个帖子的所有内容都分成一个字符串并将它们分成单词，然后根据哪些单词的使用次数排序。 如何将这个从最受欢迎的单词转换为最流行的短语？

Answer 1

您可能会考虑的一种技术是在Redis中使用ZSET来实现这样的功能。 如果你有非常大的数据集，你会发现你可以这样做：

$words = explode(" ", $input); // Pseudo-code for breaking a block of data into individual words.
$word_count = count($words);

$r = new Redis(); // Owlient's PHPRedis PECL extension
$r->connect("127.0.0.1", 6379);

function process_phrase($phrase) {
    global $r;
    $phrase = implode(" ", $phrase);
    $r->zIncrBy("trending_phrases", 1, $phrase);
}

for($i=0;$i<$word_count;$i++)
    for($j=1;$j<$word_count - $i;$j++)
        process_phrase(array_slice($words, $i, $j));

要检索热门短语，您可以使用：

// Assume $r is instantiated like it is above
$trending_phrases = $r->zReverseRange("trending_phrases", 0, 10);

$trending_phrases将是十大趋势短语的数组。 要做最近的趋势短语（而不是持久的全局短语集），请复制上面的所有Redis交互。 对于每次互动，请使用一个指示当前时间戳和明天时间戳（即1970年1月1日以来的天数）的键。 使用$trending_phrases array_unique检索结果时，只需检索今天和明天（或昨天）的密钥，并使用array_merge和array_unique来查找联合。

希望这可以帮助！

Answer 2

而不是分割单个单词分割单个短语，它就像那样简单。

$popular = array();

foreach ($tweets as $tweet)
{
    // split by common punctuation chars
    $sentences = preg_split('~[.!?]+~', $string);

    foreach ($sentences as $sentence)
    {
        $sentence = strtolower(trim($sentence)); // normalize sentences

        if (isset($popular[$sentence]) === false)  
        //if (array_key_exists($sentence, $popular) === false)
        {
            $popular[$sentence] = 0;
        }

        $popular[$sentence]++;
    }
}

arsort($popular);

echo '<pre>';
print_r($popular);
echo '</pre>';

如果将一个短语视为n个连续单词的聚合，那么速度会慢得多。

Answer 3

我不知道你在寻找什么类型的答案，但Laconica：

http://status.net/?source=laconica

是一个开源的twitter克隆（一个更简单的版本）。

也许你可以使用部分代码来制作自己喜欢的热门歌曲？

祝好运！

如何从大量文本中获取最流行的短语？

问题描述

3 个解决方案

解决方案1
2 已采纳 2010-10-14 04:13:23

解决方案2
1 2010-10-13 20:36:57

解决方案3
1 2010-10-14 04:20:06

如何从大量文本中获取最流行的短语？

问题描述

3 个解决方案

解决方案1 2 已采纳 2010-10-14 04:13:23

解决方案2 1 2010-10-13 20:36:57

解决方案3 1 2010-10-14 04:20:06

解决方案1
2 已采纳 2010-10-14 04:13:23

解决方案2
1 2010-10-13 20:36:57

解决方案3
1 2010-10-14 04:20:06