簡體   English   中英

輸入中每個詞最常出現的下一個詞

[英]Most frequent next word for every word in the input

您將獲得一組句子(由空格分隔的單詞),例如:

{{My name is Mat},
 {This is easy},
 {Do you know where is Mat?}}

您必須將此輸入輸入 API function BuildModel()並將系統訓練為 output 任何給定輸入詞的下一個最頻繁出現的詞。 這意味着在BuildModel()完成運行后,您可以調用另一個 API function GetNextWord('is')並獲取'Mat'而不是'easy' 如您所見'Mat''is'之后出現了 2 次。

BuildModel(vector<string> & v);
GetNextWord(string w);

我們可以在線性時間內完成嗎?

編輯:我將其視為掃描單詞並構建一個 map,其中每個單詞都與一個最大優先級隊列相關聯。 優先級隊列中填充了一對{word, frequency} ,因此頻率最高的詞排在最前面。 我們可以簡單地在GetNextWord中調用top() 這樣做的一個問題是,一旦我們需要增加頻率,就無法輕易更新優先級隊列。 所以首先為每個單詞構造map,然后將所有內容轉換為隊列。 如果我們保留 map,則 output 由 key 定義

map<string, int> MAP{ {"Mat",2}, {"easy",1} };

我的標准答案是:使用某種 Trie。 我認為它是為此目的而制作的。 但是一個trie有巨大的memory消耗。

而且,如果要求只是查看下一個單詞,則可以使用標准計數方法來完成。

首先我們用一個簡單的std::transform把所有的東西都變成小寫。 然后,我使用任何算法將文本拆分成句子。 在這里,我例如使用帶有std::sregex_token_iterator的方法。 接下來,我使用簡單的std::istream_iterator從句子中提取所有單詞。 很好,現在我們在std::vector中有了一個句子的所有單詞。 使用它的索引運算符[]我們可以很容易地訪問一個詞及其后面的詞,它有一個“索引 + 1”。

作為我使用的主要數據結構

std::unordered_map<std::string, std::unordered_map<std::string, size_t>

在第一個鍵中,我們將存儲“前面”的詞。 在第二個,內部std::unordered_map中,我們將存儲給定“前”詞的所有“后續”詞及其計數。 這樣我們就可以使用標准計數方法:

for (size_t wordIndex{}; wordIndex < words.size() - 1; ++wordIndex)
    followCounter[words[wordIndex]]  [words[wordIndex+1]]  ++;

這很簡單。


然后,接下來是排序。

我將使用基於std::vector的最大堆。 我不需要std::priority_queue包裝器。 因此,在第一步中,我們使用std::vector的范圍構造函數將數據從內部std::unordered_map復制到std::vector vector 。 這應該是可以理解的。

然后我們使用std::make_heap通過提供自定義排序仿函數來構建最大堆。

最后但同樣重要的是,我們創建一個數據結構

std::unordered_map<std::string, std::vector<std::pair<std::string, size_t>

內部std::vector是最大堆。 外鍵是“前”字。

我們將剛剛創建的 Max Heap 移到這個結構中,得到最終的結果。

就是這樣。 我們需要 roundabout 13 語句來實現您的“BuildModel”等效功能。

獲得“關注”一詞也簡單快捷。 請查看調試 output。

如果我們以后要加字,不幸的是我們需要再次寫 7 條語句。 我們首先得到對新“前”字的引用。 如果它不存在,那么std::unordered_map的索引運算符[]將為我們創建一個新條目並返回一個索引給它。 它關聯的 Max Hep 將為空。

然后,我們在(已經存在或剛剛創建的)Max Heap 中搜索“follow”字。 如果它存在,那么我們將增加它的計數器。 如果不是,我們使用std::push_back操作和一對新的“follow”-word 和計數 1。最后但同樣重要的是,我們再次堆化它,就是這樣。

數據結構有點難讀。 因此我添加了很多using語句,以增加可讀性。

我傾向於認為這是一個合理的實現,但只有基准測試才能給出最終答案。

正如我在一開始所說的,A Trie 將勝過這里的任何東西。

反正。 請查看針對您的問題的一種可能的解決方案。

#include <iostream>
#include <sstream>
#include <vector>
#include <unordered_map>
#include <string>
#include <regex>
#include <algorithm>
#include <utility>
#include <iomanip>

// Copyright: https://linguapress.com/intermediate/silicon-valley.htm
std::string text{ R"(The story of Silicon Valley. 
 If old America was made in New York or Detroit, modern America is made in Silicon Valley. But what is "Silicon Valley", where is it? And why is it where it is?
San Jose. San Jose,in the heart of Silicon Valley. It is not made of silicon; and it is not a river valley; but forgetting that, Silicon Valley is probably the most famous valley in the world. 
Although it is not the place where the first computer was built (that was Manchester, England), Silicon Valley, near San Francisco,  
was the birthplace of the modern computer industry. For this, we can say thankyou to scientists at the universities in California, and to the Hippies of the 1960's.
It was in the nineteen-sixties that American "youth culture" really began. California, of course, already existed; but the Sixties Generation rediscovered it.
At the time there were really two different forms of youth culture; the "Beach Boy" culture on the one hand, and the anti-establishment hippies and radical students 
on the other hand; and they all dreamed of California. For the Beach Boys, that meant southern California, where they could sing about surfing and cars; 
for the Hippies and radicals, it meant San Francisco, "flower power" and revolutionary new ideas. The campuses at Berkeley and Stamford, near San Francisco, were hot-beds 
of new ideas, new technology, new culture, and new ways of living. When they finished university, many of the best students did not look for jobs with big 
companies like Ford or Exxon. Instead they wanted to be free and run their own operations and stay in California, not far from San Francisco. Silicon Valley 
is thus a group of small towns, including Palo Alto and San Jose, a few miles south of San Francisco. The high-technology industry was already present around 
San Francisco. Intel had been founded in 1968, and in the same year the first computer mouse was built at Stamford University. In 1970, Xerox opened a research 
center in Palo Alto. There were also other electronics companies, like Hewlett Packard, and Fairchild, the world's first "semiconductor" company.
Then, in 1976, an electronics student called Steve Jobs started a small computer company in his garage; he gave it the same name as the Beatles' record company: Apple.
Very soon, more companies, like Seagate and Google appeared. "Silicon Valley" had arrived. There was even a sort of primitive Internet connecting many addresses 
in Silicon Valley, called the Arpanet. Today, Silicon Valley is still the home of the computer industry; it is still full of high technology, but it is not 
the only center for high-tech in the USA. Today here are computer firms all over the USA and all over the world; but Silicon Valley still has the largest 
concentration of high-tech companies and research centers. Microsoft, the world's biggest high-tech company, is not based in Silicon Valley. 
It is further north, near Seattle in the state of Washington.)" };

// Create aliases. Save typing work and make code more readable -------------------------
using Pair = std::pair<std::string, size_t>;
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
using FollowCounter = std::unordered_map<Pair::first_type, Counter>;
using MaxHeap = std::vector<Pair>;
using FollowMaxHeap = std::unordered_map <Pair::first_type, MaxHeap>;
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return p1.second < p2.second; }};
using SVec = std::vector<Pair::first_type>;
const std::regex reSentence("[.;:]");

// --------------------------------------------------------------------------------------
int main() {
    // Convert into all lowercase
    std::transform(text.begin(), text.end(), text.begin(),[](char c) { return static_cast<char>(std::tolower(c)); });

    // Split text into sentences
    SVec sentences(std::sregex_token_iterator(text.begin(), text.end(), reSentence, -1), {});

    // Here we will store the count of words that always follow a specific other word
    FollowCounter followCounter{};

    // Go over all sentences
    for (const Pair::first_type& sentence : sentences) {

        // Extract all words from this sentence
        std::istringstream iss(sentence);
        SVec words(std::istream_iterator<std::string>(iss), {});

        // And do the counting for this word and its follower
        for (size_t wordIndex{}; wordIndex < words.size() - 1; ++wordIndex)
            followCounter[words[wordIndex]][words[wordIndex+1]]++;
    }
    // Here we will store the sorted result.
    FollowMaxHeap followMaxHeap{};

    // Check all words and counts
    for (const auto& [word, counter] : followCounter) {

        // Copy all counts and make create a Max Heap for them for each word
        MaxHeap maxHeap(counter.begin(), counter.end());
        std::make_heap(maxHeap.begin(), maxHeap.end(), Comp());
        // And store the result
        followMaxHeap[word] = std::move(maxHeap);
    }
    // Show debug output
    for (const auto& [word, maxHeap] : followMaxHeap)
        std::cout << std::right << std::setw(20) << word << " --> " << std::left << std::setw(20) << maxHeap.front().first << " --> " << maxHeap.front().second << '\n';

    // Now add a new "follow" word
    std::string startWord{ "it" }, followWord{ "is" };

    // Get a reference to the Max hewp of the front word
    MaxHeap& mh = followMaxHeap[startWord];

    // Look, if the word exists in the MaxHeap
    MaxHeap::iterator fw = std::find_if(mh.begin(), mh.end(), [&](const Pair& p) { return p.first == followWord; });

    // If yes, then increment its count, otherwise, add a new one
    if (fw != mh.end())  
        ++fw->second;
    else
        mh.push_back({ followWord , 1});
    // heapify again
    std::push_heap(mh.begin(), mh.end(), Comp());

    // Show debug output
    for (const auto& [word, maxHeap] : followMaxHeap)
        std::cout << std::right << std::setw(20) << word << " --> " << std::left << std::setw(20) << maxHeap.front().first << " --> " << maxHeap.front().second << '\n';

    return 0;
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM