繁体   English   中英

输入中每个词最常出现的下一个词

[英]Most frequent next word for every word in the input

您将获得一组句子(由空格分隔的单词),例如:

{{My name is Mat},
 {This is easy},
 {Do you know where is Mat?}}

您必须将此输入输入 API function BuildModel()并将系统训练为 output 任何给定输入词的下一个最频繁出现的词。 这意味着在BuildModel()完成运行后,您可以调用另一个 API function GetNextWord('is')并获取'Mat'而不是'easy' 如您所见'Mat''is'之后出现了 2 次。

BuildModel(vector<string> & v);
GetNextWord(string w);

我们可以在线性时间内完成吗?

编辑:我将其视为扫描单词并构建一个 map,其中每个单词都与一个最大优先级队列相关联。 优先级队列中填充了一对{word, frequency} ,因此频率最高的词排在最前面。 我们可以简单地在GetNextWord中调用top() 这样做的一个问题是,一旦我们需要增加频率,就无法轻易更新优先级队列。 所以首先为每个单词构造map,然后将所有内容转换为队列。 如果我们保留 map,则 output 由 key 定义

map<string, int> MAP{ {"Mat",2}, {"easy",1} };

我的标准答案是:使用某种 Trie。 我认为它是为此目的而制作的。 但是一个trie有巨大的memory消耗。

而且,如果要求只是查看下一个单词,则可以使用标准计数方法来完成。

首先我们用一个简单的std::transform把所有的东西都变成小写。 然后,我使用任何算法将文本拆分成句子。 在这里,我例如使用带有std::sregex_token_iterator的方法。 接下来,我使用简单的std::istream_iterator从句子中提取所有单词。 很好,现在我们在std::vector中有了一个句子的所有单词。 使用它的索引运算符[]我们可以很容易地访问一个词及其后面的词,它有一个“索引 + 1”。

作为我使用的主要数据结构

std::unordered_map<std::string, std::unordered_map<std::string, size_t>

在第一个键中,我们将存储“前面”的词。 在第二个,内部std::unordered_map中,我们将存储给定“前”词的所有“后续”词及其计数。 这样我们就可以使用标准计数方法:

for (size_t wordIndex{}; wordIndex < words.size() - 1; ++wordIndex)
    followCounter[words[wordIndex]]  [words[wordIndex+1]]  ++;

这很简单。


然后,接下来是排序。

我将使用基于std::vector的最大堆。 我不需要std::priority_queue包装器。 因此,在第一步中,我们使用std::vector的范围构造函数将数据从内部std::unordered_map复制到std::vector vector 。 这应该是可以理解的。

然后我们使用std::make_heap通过提供自定义排序仿函数来构建最大堆。

最后但同样重要的是,我们创建一个数据结构

std::unordered_map<std::string, std::vector<std::pair<std::string, size_t>

内部std::vector是最大堆。 外键是“前”字。

我们将刚刚创建的 Max Heap 移到这个结构中,得到最终的结果。

就是这样。 我们需要 roundabout 13 语句来实现您的“BuildModel”等效功能。

获得“关注”一词也简单快捷。 请查看调试 output。

如果我们以后要加字,不幸的是我们需要再次写 7 条语句。 我们首先得到对新“前”字的引用。 如果它不存在,那么std::unordered_map的索引运算符[]将为我们创建一个新条目并返回一个索引给它。 它关联的 Max Hep 将为空。

然后,我们在(已经存在或刚刚创建的)Max Heap 中搜索“follow”字。 如果它存在,那么我们将增加它的计数器。 如果不是,我们使用std::push_back操作和一对新的“follow”-word 和计数 1。最后但同样重要的是,我们再次堆化它,就是这样。

数据结构有点难读。 因此我添加了很多using语句,以增加可读性。

我倾向于认为这是一个合理的实现,但只有基准测试才能给出最终答案。

正如我在一开始所说的,A Trie 将胜过这里的任何东西。

反正。 请查看针对您的问题的一种可能的解决方案。

#include <iostream>
#include <sstream>
#include <vector>
#include <unordered_map>
#include <string>
#include <regex>
#include <algorithm>
#include <utility>
#include <iomanip>

// Copyright: https://linguapress.com/intermediate/silicon-valley.htm
std::string text{ R"(The story of Silicon Valley. 
 If old America was made in New York or Detroit, modern America is made in Silicon Valley. But what is "Silicon Valley", where is it? And why is it where it is?
San Jose. San Jose,in the heart of Silicon Valley. It is not made of silicon; and it is not a river valley; but forgetting that, Silicon Valley is probably the most famous valley in the world. 
Although it is not the place where the first computer was built (that was Manchester, England), Silicon Valley, near San Francisco,  
was the birthplace of the modern computer industry. For this, we can say thankyou to scientists at the universities in California, and to the Hippies of the 1960's.
It was in the nineteen-sixties that American "youth culture" really began. California, of course, already existed; but the Sixties Generation rediscovered it.
At the time there were really two different forms of youth culture; the "Beach Boy" culture on the one hand, and the anti-establishment hippies and radical students 
on the other hand; and they all dreamed of California. For the Beach Boys, that meant southern California, where they could sing about surfing and cars; 
for the Hippies and radicals, it meant San Francisco, "flower power" and revolutionary new ideas. The campuses at Berkeley and Stamford, near San Francisco, were hot-beds 
of new ideas, new technology, new culture, and new ways of living. When they finished university, many of the best students did not look for jobs with big 
companies like Ford or Exxon. Instead they wanted to be free and run their own operations and stay in California, not far from San Francisco. Silicon Valley 
is thus a group of small towns, including Palo Alto and San Jose, a few miles south of San Francisco. The high-technology industry was already present around 
San Francisco. Intel had been founded in 1968, and in the same year the first computer mouse was built at Stamford University. In 1970, Xerox opened a research 
center in Palo Alto. There were also other electronics companies, like Hewlett Packard, and Fairchild, the world's first "semiconductor" company.
Then, in 1976, an electronics student called Steve Jobs started a small computer company in his garage; he gave it the same name as the Beatles' record company: Apple.
Very soon, more companies, like Seagate and Google appeared. "Silicon Valley" had arrived. There was even a sort of primitive Internet connecting many addresses 
in Silicon Valley, called the Arpanet. Today, Silicon Valley is still the home of the computer industry; it is still full of high technology, but it is not 
the only center for high-tech in the USA. Today here are computer firms all over the USA and all over the world; but Silicon Valley still has the largest 
concentration of high-tech companies and research centers. Microsoft, the world's biggest high-tech company, is not based in Silicon Valley. 
It is further north, near Seattle in the state of Washington.)" };

// Create aliases. Save typing work and make code more readable -------------------------
using Pair = std::pair<std::string, size_t>;
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
using FollowCounter = std::unordered_map<Pair::first_type, Counter>;
using MaxHeap = std::vector<Pair>;
using FollowMaxHeap = std::unordered_map <Pair::first_type, MaxHeap>;
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return p1.second < p2.second; }};
using SVec = std::vector<Pair::first_type>;
const std::regex reSentence("[.;:]");

// --------------------------------------------------------------------------------------
int main() {
    // Convert into all lowercase
    std::transform(text.begin(), text.end(), text.begin(),[](char c) { return static_cast<char>(std::tolower(c)); });

    // Split text into sentences
    SVec sentences(std::sregex_token_iterator(text.begin(), text.end(), reSentence, -1), {});

    // Here we will store the count of words that always follow a specific other word
    FollowCounter followCounter{};

    // Go over all sentences
    for (const Pair::first_type& sentence : sentences) {

        // Extract all words from this sentence
        std::istringstream iss(sentence);
        SVec words(std::istream_iterator<std::string>(iss), {});

        // And do the counting for this word and its follower
        for (size_t wordIndex{}; wordIndex < words.size() - 1; ++wordIndex)
            followCounter[words[wordIndex]][words[wordIndex+1]]++;
    }
    // Here we will store the sorted result.
    FollowMaxHeap followMaxHeap{};

    // Check all words and counts
    for (const auto& [word, counter] : followCounter) {

        // Copy all counts and make create a Max Heap for them for each word
        MaxHeap maxHeap(counter.begin(), counter.end());
        std::make_heap(maxHeap.begin(), maxHeap.end(), Comp());
        // And store the result
        followMaxHeap[word] = std::move(maxHeap);
    }
    // Show debug output
    for (const auto& [word, maxHeap] : followMaxHeap)
        std::cout << std::right << std::setw(20) << word << " --> " << std::left << std::setw(20) << maxHeap.front().first << " --> " << maxHeap.front().second << '\n';

    // Now add a new "follow" word
    std::string startWord{ "it" }, followWord{ "is" };

    // Get a reference to the Max hewp of the front word
    MaxHeap& mh = followMaxHeap[startWord];

    // Look, if the word exists in the MaxHeap
    MaxHeap::iterator fw = std::find_if(mh.begin(), mh.end(), [&](const Pair& p) { return p.first == followWord; });

    // If yes, then increment its count, otherwise, add a new one
    if (fw != mh.end())  
        ++fw->second;
    else
        mh.push_back({ followWord , 1});
    // heapify again
    std::push_heap(mh.begin(), mh.end(), Comp());

    // Show debug output
    for (const auto& [word, maxHeap] : followMaxHeap)
        std::cout << std::right << std::setw(20) << word << " --> " << std::left << std::setw(20) << maxHeap.front().first << " --> " << maxHeap.front().second << '\n';

    return 0;
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM