简体繁体 English

如何在不将查询表加载到内存的情况下对其执行搜索？

[英]How can I perform search on a lookup table without loading it in memory?

原文 2015-06-04 15:09:01 2 2 c++/ dictionary/ key-value/ lookup-tables/ key-value-store

Now I have a file recording the entries of a lookup table. 现在，我有一个文件记录了查找表的条目。 If the number of entries is small, I can simply load this file into an STL map and perform search in my code. 如果条目数量很少，我可以简单地将此文件加载到STL映射中并在代码中执行搜索。 But what if there are many many entries? 但是，如果有很多条目怎么办？ If I do it in the way above, it may cause error such as out of memory. 如果以上述方式进行操作，可能会导致错误，例如内存不足。 I'm here to listen to your advice... 我在这里听听您的建议...

PS I just want to perform search without loading all entries into memory. PS我只想执行搜索而不将所有条目加载到内存中。

Can Key-value database solve this problem? 键值数据库可以解决此问题吗？

2 个解决方案

You'll have to load the data from hard drive eventually but sure if a table is huge it won't fit into memory to do a linear search through it, so: 您最终必须从硬盘驱动器中加载数据，但要确定表是否巨大，它是否无法装入内存中以对其进行线性搜索，因此：

think if you can split the data into a set of files 想想是否可以将数据拆分为一组文件
make an index table of what file contains what entries (say the first 100 entries are in "file1_100", second hundred is in "file101_201" an so on) 制作一个索引表，确定哪个文件包含哪些条目（例如前100个条目位于“ file1_100”中，第二百个条目位于“ file101_201”中，依此类推）
using index table from step 2 locate the file to load 使用步骤2中的索引表找到要加载的文件
load the file and do a linear search 加载文件并进行线性搜索

That is a really simplified scheme for a typical database management system so you may want to use one like MySQL, PostgreSQL, MsSQL, Oracle or any one of them. 对于典型的数据库管理系统而言，这确实是一种简化的方案，因此您可能要使用MySQL，PostgreSQL，MsSQL，Oracle或它们中的任何一种。 If that's a study project then after you're done with the search problem, consider optimizing linear operations (by switching to something like binary search) and tables (real databases use balanced tree structures, hash tables and like). 如果这是一个研究项目，那么在解决搜索问题之后，请考虑优化线性运算（通过切换至类似二进制搜索的形式）和表格（实际数据库使用平衡的树结构，哈希表等）。

One method would be to reorganize the data in the file into groups. 一种方法是将文件中的数据重新组织成组。

For example, let's consider a full language dictionary. 例如，让我们考虑一个完整的语言字典。 Usually, dictionaries are too huge to read completely into memory. 通常，字典太大，无法完全读入内存。 So one idea is to group the words by first letter. 因此，一个想法是将单词按首字母分组。

In this example, you would first read in the appropriate group based on the letter. 在此示例中，您将首先根据字母阅读适当的组。 So if the word you are searching for begins with "m", you would load the "m" group into memory. 因此，如果要搜索的单词以“ m”开头，则将“ m”组加载到内存中。

There are other methods of grouping such as word (key) length. 还有其他分组方法，例如单词（关键字）长度。 There can also be subgroups too. 也可以有子组。 In this example, you could divide the "m" group by word lengths or by second letter. 在此示例中，您可以将“ m”组除以单词长度或除以第二个字母。

After grouping, you may want to write the data back to another file so you don't have to modify the data anymore. 分组后，您可能需要将数据写回到另一个文件，因此您不必再修改数据。

There are many ways to store groups on the file, such as using a "section" marker. 有多种方法可以在文件上存储组，例如使用“节”标记。 These would be for another question though. 这些将是另一个问题。

The ideas here, including from @047, are to structure the data for the most efficient search, giving your memory constraints. 这里的想法（包括来自@ 047的想法）是为最有效的搜索构造数据，从而限制了内存。