简体   繁体   English

如何轻松快速地存储大字库?

[英]How can I easily and quickly store a big word database?

I am currently working as a school project in developing a spelling checker in C++. 我目前正在作为学校项目来开发C ++的拼写检查器。 For the part which consists in checking if a word exists, I currently do the following : 对于检查单词是否存在的部分,我目前执行以下操作:

  1. I found online a .txt file with all the english existing words 我在网上找到一个.txt文件,其中包含所有英语现有单词
  2. My script starts by going through this text files and placing each of its entry in a map object, for an easy access. 我的脚本首先浏览这些文本文件,然后将其每个条目放置在map对象中,以便于访问。

The problem with this approach is that when the program starts, the step 2) takes approximately 20 secs. 这种方法的问题在于程序启动时,步骤2)大约需要20秒。 This is not a big deal in itself but I was wondering if any of you had an idea of an other approach to have my database of words quickly available. 这本身并不是什么大不了的事,但是我想知道你们中是否有人有其他方法可以快速获得我的单词数据库。 For instance, would there be a way to store the map object in a file, so that I don't need to build it from the text file every time ? 例如,是否有一种方法可以将地图对象存储在文件中,这样我就不必每次都从文本文件构建它了?

If your file with all English words is not dynamic, you can just store it in a static map. 如果所有英文单词的文件不是动态文件,则可以将其存储在静态映射中。 To do so, you need to parse your .txt file, something like: 为此,您需要解析.txt文件,例如:

alpha α

beta 公测

gamma 伽马

... ...

to convert it to something like this: 转换成这样的东西:

static std::map<std::string,int> wordDictionary = {
                { "alpha", 0 },
                { "beta", 0 },
                { "gamma", 0 } 
                   ... };

You can do that programatically or simply with find and replace in your favourite text editor. 您可以通过编程方式进行操作,也可以简单地在您喜欢的文本编辑器中使用“查找并替换”来进行操作。

Your .exe is going to be much heavier than before, but it will also start much faster than reading this info from file. 您的.exe文件将比以前重得多,但它的启动速度也比从文件中读取此信息快得多。

I'm a little bit surprised that nobody came up with the idea of serialization yet. 令我惊讶的是,还没有人想到序列化的想法。 Boost provides great support for such a solution. Boost为此类解决方案提供了强大的支持。 If I understood you correctly, the problem is that it takes too long to read in your list of words (and put them into a data structure that hopefully provides fast look-up operations), whenever you use your application. 如果我正确理解了您的问题,那么问题是,无论何时使用您的应用程序,都要花很长时间阅读单词列表(并将它们放入希望提供快速查找操作的数据结构中)。 Building up such a structure, then saving it into a binary file for later reuse should improve the performance of your application (based on the results presented below). 建立这样的结构,然后将其保存到二进制文件中以供以后重用,应该会提高应用程序的性能(基于下面显示的结果)。

Here is a piece of code (and a minimal working example, at the same time) that might help you out on this. 这是一段代码(同时是一个最小的工作示例),可以帮助您解决这一问题。

#include <chrono>
#include <fstream>
#include <iostream>
#include <set>
#include <sstream>
#include <stdexcept>
#include <string>

#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/serialization/set.hpp> 

#include "prettyprint.hpp"

class Dictionary {
public:
  Dictionary() = default;
  Dictionary(std::string const& file_)
    : _file(file_)
  {}

  inline size_t size() const { return _words.size(); }

  void build_wordset()
  {
    if (!_file.size()) { throw std::runtime_error("No file to read!"); }

    std::ifstream infile(_file);
    std::string line;

    while (std::getline(infile, line)) {
      _words.insert(line);
    }
  }

  friend std::ostream& operator<<(std::ostream& os, Dictionary const& d)
  {
    os << d._words;  // cxx-prettyprint used here
    return os;
  }

  int save(std::string const& out_file) 
  { 
    std::ofstream ofs(out_file.c_str(), std::ios::binary);
    if (ofs.fail()) { return -1; }

    boost::archive::binary_oarchive oa(ofs); 
    oa << _words;
    return 0;
  }

  int load(std::string const& in_file)
  {
    _words.clear();

    std::ifstream ifs(in_file);
    if (ifs.fail()) { return -1; }

    boost::archive::binary_iarchive ia(ifs);
    ia >> _words;
    return 0;
  }

private:
  friend class boost::serialization::access;

  template <typename Archive>
  void serialize(Archive& ar, const unsigned int version)
  {
    ar & _words;
  }

private:
  std::string           _file;
  std::set<std::string> _words;
};

void create_new_dict()
{
  std::string const in_file("words.txt");
  std::string const ser_dict("words.set");

  Dictionary d(in_file);

  auto start = std::chrono::system_clock::now();
  d.build_wordset();
  auto end = std::chrono::system_clock::now();
  auto elapsed =
    std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

  std::cout << "Building up the dictionary took: " << elapsed.count()
            << " (ms)" << std::endl
            << "Size of the dictionary: " << d.size() << std::endl;

  d.save(ser_dict);
}

void use_existing_dict()
{
  std::string const ser_dict("words.set");

  Dictionary d;

  auto start = std::chrono::system_clock::now();
  d.load(ser_dict);
  auto end = std::chrono::system_clock::now();
  auto elapsed =
    std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

  std::cout << "Loading in the dictionary took: " << elapsed.count()
            << " (ms)" << std::endl
            << "Size of the dictionary: " << d.size() << std::endl;
}

int main()
{
  create_new_dict();
  use_existing_dict();
  return 0;
}

Sorry for not putting the code into separated files and for the poor design; 很抱歉没有将代码放入单独的文件中并且设计不佳; however, for demonstrating purposes it should be just enough. 但是,出于演示目的,它就足够了。

Note that I didn't use a map: I just don't see the point of storing a lot of zeros or anything else unnecessarily. 请注意,我没有使用地图:我只是看不到不必要地存储大量零或其他任何东西的意义。 AFAIK, a std::set is backed by the same powerful RB-Tree as std::map s. AFAIK,一个std::set由与std::map相同的功能强大的RB-Tree支持。

For the data set available here (it contains around 466k words), I got the following results: 对于此处可用的数据集(包含约466k字),我得到了以下结果:

Building up the dictionary took: 810 (ms)
Size of the dictionary: 466544
Loading in the dictionary took: 271 (ms)
Size of the dictionary: 466544

Dependencies: 依赖关系:

Hope this helps. 希望这可以帮助。 :) Cheers. :)干杯。

First things first. 首先是第一件事。 Do not use a map (or a set) for storing a word list. 不要使用地图(或集合)来存储单词列表。 Use a vector of strings, make sure its contents are sorted (I would believe your word list is already sorted), and then use binary_find from the <algorithm> header to check if a word is already in the dictionary. 使用字符串向量,确保其内容已排序(我相信您的单词列表已经排序),然后使用<algorithm>标头中的binary_find来检查单词是否已在词典中。

Although this may still be highly suboptimal (depending on whether your compiler does a small string optimisation), your load times will improve by at least an order of magnitude. 尽管这仍然可能不是理想(取决于编译器是否进行了小的字符串优化),但是您的加载时间将至少缩短一个数量级。 Do a benchmark and if you want to make it faster still, post another question on the vector of strings. 做一个基准测试,如果您想使其更快,请在字符串向量上发布另一个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM