STL 容器以获得最佳性能？

Question

我有一个项目，我需要读取一个文本文件并记录在 EoF 之前读取的每个字符串、字符或数字的出现次数。

然后我需要打印前 10 个最常用的单词。

例如，该文件将包含“这是该项目的测试”。 我会阅读这个并将每个单词存储在一个容器中以及它的当前计数。

现在，随着输入的增长，我们对时间复杂度的效率进行了分级。 所以，我需要一些帮助来选择最有效的 STL 容器。

似乎顺序并不重要，我可以永远在最后插入，而且我永远不必插入。 但是，我必须在容器中搜索前 10 个最常用的单词。 对于这样的需求，哪个 STL 容器具有最佳的时间复杂度？

另外，如果你能解释你的推理，让我对未来有更多的了解，那就太好了！

Answer 1

我为这样的任务使用了两个容器： std::unordered_map<std::string, int>来存储词频， std::map<int, std::string>来跟踪最常用的词。

在用新词更新第一个地图的同时，您也更新了第二个地图。 为了保持整洁，如果第二张地图的大小超过 10，请删除最不常用的单词。

更新

为了回应下面的评论，我做了一些基准测试。

首先，@PaulMcKenzie - 你是对的：为了保持联系，我需要std::map<int, std::set<std::string>> （这在我开始实施时就变得很明显了）。

其次，@dratenik - 事实证明你也是对的。 虽然不断清理频率图可以保持它很小，但开销并不能带来好处。 此外，只有在客户想要查看“运行总数”时才需要这样做（正如我在我的项目中所要求的那样）。 当所有单词都被加载时，在后期处理中完全没有意义。

对于测试，我使用了alice29.txt （可在线获取），经过预处理 - 我删除了所有标点符号并转换为大写。 这是我的代码：

int main()
{
  auto t1 = std::chrono::high_resolution_clock::now();
  std::ifstream src("c:\\temp\\alice29-clean.txt");
  std::string str;
  std::unordered_map<std::string, int> words;
  std::map<int, std::set<std::string>> freq;
  int i(0);
  while (src >> str)
  {
    words[str]++;
    i++;
  }
  for (auto& w : words)
  {
    freq[w.second].insert(w.first);
  }
  int count(0);
  for (auto it = freq.rbegin(); it != freq.rend(); ++it)
  {
    for (auto& w : it->second)
    {
      std::cout << w << " - " << it->first << std::endl;
      ++count;
    }
    if (count >= 10)
      break;
  }
  auto t2 = std::chrono::high_resolution_clock::now();
  std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << std::endl;
  return i;
}

Answer 2

假设您决定使用std::unordered_map<std::string, int>来获取项目的频率计数。 这是一个好的开始，但需要解决的问题的另一部分是获得前 10 项。

每当问题询问“获得前 N 个”或“获得最小的 N”或类似问题时，都有多种获取此信息的方法。

一种方法是对数据进行排序并获取前N项目。 使用std::sort或良好的排序例程，该操作的时间复杂度应为O(N*log(N)) 。

另一种方法是使用N项的最小堆或最大堆，具体取决于您是要分别获得顶部N还是底部N 。

假设您有使用unordered_set来获取频率计数的工作代码。 这是一个使用 STL 堆函数获取前N项目的例程。 它尚未经过全面测试，但应演示如何处理堆。

#include <vector>
#include <algorithm>
#include <iostream>
#include <unordered_map>

void print_top_n(const std::unordered_map<std::string, int>& theMap, size_t n)
{
    // This is the heap
    std::vector<std::pair<std::string, int>> vHeap;

    // This lambda is the predicate to build and perform the heapify 
    auto heapfn =
        [](std::pair<std::string, int>& p1, std::pair<std::string, int>& p2) -> bool
    { return p1.second > p2.second; };

    // Go through each entry in the map
    for (auto& m : theMap)
    {
        if (vHeap.size() < n)
        {
            // Add item to the heap, since we haven't reached n items yet 
            vHeap.push_back(m);
            
            // if we have reached n items, now is the time to build the heap  
            if (vHeap.size() == n)
                // make the min-heap of the N elements   
                std::make_heap(vHeap.begin(), vHeap.end(), heapfn);
            continue;
        }
        else
        // Heap has been built.  Check if the next element is larger than the 
        // top of the heap
        if (vHeap.front().second <= m.second)
        {
            // adjust the heap 
            // remove the front of the heap by placing it at the end of the vector
            std::pop_heap(vHeap.begin(), vHeap.end(), heapfn);
            // get rid of that item now 
            vHeap.pop_back();
            // add the new item 
            vHeap.push_back(m);
            // heapify
            std::push_heap(vHeap.begin(), vHeap.end(), heapfn);
        }
    }

    // sort the heap    
    std::sort_heap(vHeap.begin(), vHeap.end(), heapfn);

    // Output the results
    for (auto& v : vHeap)
        std::cout << v.first << " " << v.second << "\n";
}

int main()
{
    std::unordered_map<std::string, int> test = { {"abc", 10},
        { "123",5 },
        { "456",1 },
        { "xyz",15 },
        { "go",8 },
        { "text1",7 },
        { "text2",17 },
        { "text3",27 },
        { "text4",37 },
        { "text5",47 },
        { "text6",9 },
        { "text7",7 },
        { "text8", 22 },
        { "text9", 8 },
        { "text10", 2 } };
    print_top_n(test, 10);
}

输出：

text5 47
text4 37
text3 27
text8 22
text2 17
xyz 15
abc 10
text6 9
text9 8
go 8

使用堆的优点是：

堆的复杂性是O(log(N)) ，而不是排序例程会给你的通常的O(N*log(N)) 。
请注意，我们只需要在检测到最小堆上的顶部项目将被丢弃时才需要堆化。
除了字符串到频率计数的原始映射之外，我们不需要将频率计数的整个（多）映射存储到字符串。
堆将只存储N元素，而不管原始映射中有多少项。

STL 容器以获得最佳性能？

问题描述

2 个解决方案

解决方案1
0 2020-11-18 19:27:19

解决方案2
0 2020-11-18 20:34:32

STL 容器以获得最佳性能？

问题描述

2 个解决方案

解决方案1 0 2020-11-18 19:27:19

解决方案2 0 2020-11-18 20:34:32

解决方案1
0 2020-11-18 19:27:19

解决方案2
0 2020-11-18 20:34:32