简体   繁体   English

输入中每个词最常出现的下一个词

[英]Most frequent next word for every word in the input

You are given a collection of sentences (words separated by whitespaces), for example:您将获得一组句子(由空格分隔的单词),例如:

{{My name is Mat},
 {This is easy},
 {Do you know where is Mat?}}

You have to feed this input into an API function BuildModel() and train system to output the next most frequent word for any given input word.您必须将此输入输入 API function BuildModel()并将系统训练为 output 任何给定输入词的下一个最频繁出现的词。 Which means that after BuildModel() has completed to run, you may call another API function GetNextWord('is') and get 'Mat' instead of 'easy' .这意味着在BuildModel()完成运行后,您可以调用另一个 API function GetNextWord('is')并获取'Mat'而不是'easy' As you may see 'Mat' is appearing 2 times after 'is' .如您所见'Mat''is'之后出现了 2 次。

BuildModel(vector<string> & v);
GetNextWord(string w);

Can we do it in linear time?我们可以在线性时间内完成吗?

EDIT: I see it as scanning the words and building a map where each word is associated with a max priority queue.编辑:我将其视为扫描单词并构建一个 map,其中每个单词都与一个最大优先级队列相关联。 Priority queue is filled with a pair {word, frequency} so that the word with most frequency is on top.优先级队列中填充了一对{word, frequency} ,因此频率最高的词排在最前面。 And we can simply call top() in GetNextWord .我们可以简单地在GetNextWord中调用top() One problem with this is that priority queue cannot be easily updated once we need to add frequency.这样做的一个问题是,一旦我们需要增加频率,就无法轻易更新优先级队列。 So first construct map for each word and then transform everything into queue.所以首先为每个单词构造map,然后将所有内容转换为队列。 If we keep the map, the output is defined by key如果我们保留 map,则 output 由 key 定义

map<string, int> MAP{ {"Mat",2}, {"easy",1} };

My standard answer would be: Use some kind of Trie.我的标准答案是:使用某种 Trie。 I think it was made for this purpose.我认为它是为此目的而制作的。 But a trie has a huge memory consumption.但是一个trie有巨大的memory消耗。

And, if the requirement is just to look at the next word, then this can be done with a standard counting approach.而且,如果要求只是查看下一个单词,则可以使用标准计数方法来完成。

First we make everything lower case with a simple std::transform .首先我们用一个简单的std::transform把所有的东西都变成小写。 Then, I use any algorithm to split a text into sentences.然后,我使用任何算法将文本拆分成句子。 Here, I use for example the method with a std::sregex_token_iterator .在这里,我例如使用带有std::sregex_token_iterator的方法。 And next, I extract all words from a sentence with a simple std::istream_iterator .接下来,我使用简单的std::istream_iterator从句子中提取所有单词。 Good, now we have all words of a sentence in a std::vector .很好,现在我们在std::vector中有了一个句子的所有单词。 Using its index operator [] we can easily access a word and its following word, which has an "index + 1".使用它的索引运算符[]我们可以很容易地访问一个词及其后面的词,它有一个“索引 + 1”。

As the main data structure I use a作为我使用的主要数据结构

std::unordered_map<std::string, std::unordered_map<std::string, size_t>

In the first key, we will store the "front"-words.在第一个键中,我们将存储“前面”的词。 In the 2nd, inner std::unordered_map , we will store all "follow"-words for the given "front"-words and its count.在第二个,内部std::unordered_map中,我们将存储给定“前”词的所有“后续”词及其计数。 With that we can use the standard counting approach:这样我们就可以使用标准计数方法:

for (size_t wordIndex{}; wordIndex < words.size() - 1; ++wordIndex)
    followCounter[words[wordIndex]]  [words[wordIndex+1]]  ++;

This is very simple.这很简单。


Then, next is sorting.然后,接下来是排序。

I will use a Max Heap based on a std::vector .我将使用基于std::vector的最大堆。 I do not need the std::priority_queue wrapper.我不需要std::priority_queue包装器。 So, in the first step, we copy data from the inner std::unordered_map to the std::vector , using the std::vector s range constructor.因此,在第一步中,我们使用std::vector的范围构造函数将数据从内部std::unordered_map复制到std::vector vector 。 That should be understandable.这应该是可以理解的。

Then we use std::make_heap to build a Max Heap by suppling a custom sorting functor.然后我们使用std::make_heap通过提供自定义排序仿函数来构建最大堆。

And last but not least, we create a data structure最后但同样重要的是,我们创建一个数据结构

std::unordered_map<std::string, std::vector<std::pair<std::string, size_t>

The inner std::vector is the Max Heap.内部std::vector是最大堆。 The outer key is the "front"-word.外键是“前”字。

We move the just created Max Heap into this structure, to get the final result.我们将刚刚创建的 Max Heap 移到这个结构中,得到最终的结果。

And that's it.就是这样。 We need roundabout 13 statements for your "BuildModel" equivalent functionality.我们需要 roundabout 13 语句来实现您的“BuildModel”等效功能。

Getting a "follow" word is also simple and fast.获得“关注”一词也简单快捷。 Please look at the debgug output.请查看调试 output。

If we want to add words later, unfortunately we need again to write 7 statements.如果我们以后要加字,不幸的是我们需要再次写 7 条语句。 We first get a reference to the new "front" word.我们首先得到对新“前”字的引用。 If it is not existing, then the std::unordered_map s index operator [] will create a new entry for us and also return an index to it.如果它不存在,那么std::unordered_map的索引运算符[]将为我们创建一个新条目并返回一个索引给它。 It's associated Max Hep will be empty.它关联的 Max Hep 将为空。

Then, we search in the (either already existing or just created) Max Heap the "follow"-word.然后,我们在(已经存在或刚刚创建的)Max Heap 中搜索“follow”字。 If it is existing, then we will increment its counter.如果它存在,那么我们将增加它的计数器。 If, not, we use a std::push_back operation with a new pair of the "follow"-word and a count of 1. Last but not least, we heapify it again, and, that's it.如果不是,我们使用std::push_back操作和一对新的“follow”-word 和计数 1。最后但同样重要的是,我们再次堆化它,就是这样。

The data structures are a little bit hard to read.数据结构有点难读。 Therefore I added a lot of using -statements, to increase readablity.因此我添加了很多using语句,以增加可读性。

I tend to think that this is a reasonable implementation, but only a benchmark will give the final answer.我倾向于认为这是一个合理的实现,但只有基准测试才能给出最终答案。

As I said in the beginning, A Trie will outperfrom anything here.正如我在一开始所说的,A Trie 将胜过这里的任何东西。

Anyway.反正。 Please see one potential solution for your problem.请查看针对您的问题的一种可能的解决方案。

#include <iostream>
#include <sstream>
#include <vector>
#include <unordered_map>
#include <string>
#include <regex>
#include <algorithm>
#include <utility>
#include <iomanip>

// Copyright: https://linguapress.com/intermediate/silicon-valley.htm
std::string text{ R"(The story of Silicon Valley. 
 If old America was made in New York or Detroit, modern America is made in Silicon Valley. But what is "Silicon Valley", where is it? And why is it where it is?
San Jose. San Jose,in the heart of Silicon Valley. It is not made of silicon; and it is not a river valley; but forgetting that, Silicon Valley is probably the most famous valley in the world. 
Although it is not the place where the first computer was built (that was Manchester, England), Silicon Valley, near San Francisco,  
was the birthplace of the modern computer industry. For this, we can say thankyou to scientists at the universities in California, and to the Hippies of the 1960's.
It was in the nineteen-sixties that American "youth culture" really began. California, of course, already existed; but the Sixties Generation rediscovered it.
At the time there were really two different forms of youth culture; the "Beach Boy" culture on the one hand, and the anti-establishment hippies and radical students 
on the other hand; and they all dreamed of California. For the Beach Boys, that meant southern California, where they could sing about surfing and cars; 
for the Hippies and radicals, it meant San Francisco, "flower power" and revolutionary new ideas. The campuses at Berkeley and Stamford, near San Francisco, were hot-beds 
of new ideas, new technology, new culture, and new ways of living. When they finished university, many of the best students did not look for jobs with big 
companies like Ford or Exxon. Instead they wanted to be free and run their own operations and stay in California, not far from San Francisco. Silicon Valley 
is thus a group of small towns, including Palo Alto and San Jose, a few miles south of San Francisco. The high-technology industry was already present around 
San Francisco. Intel had been founded in 1968, and in the same year the first computer mouse was built at Stamford University. In 1970, Xerox opened a research 
center in Palo Alto. There were also other electronics companies, like Hewlett Packard, and Fairchild, the world's first "semiconductor" company.
Then, in 1976, an electronics student called Steve Jobs started a small computer company in his garage; he gave it the same name as the Beatles' record company: Apple.
Very soon, more companies, like Seagate and Google appeared. "Silicon Valley" had arrived. There was even a sort of primitive Internet connecting many addresses 
in Silicon Valley, called the Arpanet. Today, Silicon Valley is still the home of the computer industry; it is still full of high technology, but it is not 
the only center for high-tech in the USA. Today here are computer firms all over the USA and all over the world; but Silicon Valley still has the largest 
concentration of high-tech companies and research centers. Microsoft, the world's biggest high-tech company, is not based in Silicon Valley. 
It is further north, near Seattle in the state of Washington.)" };

// Create aliases. Save typing work and make code more readable -------------------------
using Pair = std::pair<std::string, size_t>;
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
using FollowCounter = std::unordered_map<Pair::first_type, Counter>;
using MaxHeap = std::vector<Pair>;
using FollowMaxHeap = std::unordered_map <Pair::first_type, MaxHeap>;
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return p1.second < p2.second; }};
using SVec = std::vector<Pair::first_type>;
const std::regex reSentence("[.;:]");

// --------------------------------------------------------------------------------------
int main() {
    // Convert into all lowercase
    std::transform(text.begin(), text.end(), text.begin(),[](char c) { return static_cast<char>(std::tolower(c)); });

    // Split text into sentences
    SVec sentences(std::sregex_token_iterator(text.begin(), text.end(), reSentence, -1), {});

    // Here we will store the count of words that always follow a specific other word
    FollowCounter followCounter{};

    // Go over all sentences
    for (const Pair::first_type& sentence : sentences) {

        // Extract all words from this sentence
        std::istringstream iss(sentence);
        SVec words(std::istream_iterator<std::string>(iss), {});

        // And do the counting for this word and its follower
        for (size_t wordIndex{}; wordIndex < words.size() - 1; ++wordIndex)
            followCounter[words[wordIndex]][words[wordIndex+1]]++;
    }
    // Here we will store the sorted result.
    FollowMaxHeap followMaxHeap{};

    // Check all words and counts
    for (const auto& [word, counter] : followCounter) {

        // Copy all counts and make create a Max Heap for them for each word
        MaxHeap maxHeap(counter.begin(), counter.end());
        std::make_heap(maxHeap.begin(), maxHeap.end(), Comp());
        // And store the result
        followMaxHeap[word] = std::move(maxHeap);
    }
    // Show debug output
    for (const auto& [word, maxHeap] : followMaxHeap)
        std::cout << std::right << std::setw(20) << word << " --> " << std::left << std::setw(20) << maxHeap.front().first << " --> " << maxHeap.front().second << '\n';

    // Now add a new "follow" word
    std::string startWord{ "it" }, followWord{ "is" };

    // Get a reference to the Max hewp of the front word
    MaxHeap& mh = followMaxHeap[startWord];

    // Look, if the word exists in the MaxHeap
    MaxHeap::iterator fw = std::find_if(mh.begin(), mh.end(), [&](const Pair& p) { return p.first == followWord; });

    // If yes, then increment its count, otherwise, add a new one
    if (fw != mh.end())  
        ++fw->second;
    else
        mh.push_back({ followWord , 1});
    // heapify again
    std::push_heap(mh.begin(), mh.end(), Comp());

    // Show debug output
    for (const auto& [word, maxHeap] : followMaxHeap)
        std::cout << std::right << std::setw(20) << word << " --> " << std::left << std::setw(20) << maxHeap.front().first << " --> " << maxHeap.front().second << '\n';

    return 0;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM