简体繁体 English

快速的字符串搜索

[英]Fast in string search

原文 2013-01-22 20:26:56 6 2 c++/ search

I have a problem that I am looking for some guidance to solve the most efficient way. 我有一个问题，我正在寻找一些指导以解决最有效的方法。 I have 200 million strings of data ranging in size from 3 characters to 70 characters. 我有2亿个数据字符串，大小从3个字符到70个字符不等。 The strings consist of letters numbers and several special characters such as dashes and underscores. 字符串由字母数字和一些特殊字符组成，例如破折号和下划线。 I need to be able to quickly search for the entire string or any substring within a string (minimum substring size is 3). 我需要能够快速搜索整个字符串或字符串中的任何子字符串（最小子字符串大小为3）。 Quickly is defined here as less than 1 second. 快速定义为少于1秒。

As my first cut at this I did the following: 作为我的第一个切入点，我做了以下工作：

Created 38 index files. 创建了38个索引文件。 An index contains all the substrings that start with a particular letter. 索引包含以特定字母开头的所有子字符串。 The first 4mb contains 1 million hash buckets (start of the hash chains). 前4mb包含100万个哈希桶（哈希链的开始）。 The rest of the index contains the linked list chains from the hash buckets. 索引的其余部分包含来自哈希存储桶的链接列表链。 My hashing is very evenly distributed. 我的哈希非常均匀地分布。 The 1 million hash buckets are kept in RAM and mirrored to disk. 1百万个哈希存储桶保留在RAM中并镜像到磁盘。
When a string is added to the index it is broken down into its non-duplicate (within itself) 3-n character substrings (when n is the length of the string-1). 当将字符串添加到索引时，它会分解为非重复的（在其内部）3-n个字符子字符串（当n是字符串-1的长度时）。 So, for example, "apples" is stored in the "A" index as pples,pple,ppl,pp (substrings are also stored in the "L" and "P" indexes). 因此，例如，“苹果”以pples，pple，ppl，pp的形式存储在“ A”索引中（子字符串也存储在“ L”和“ P”索引中）。

The search/add server runs as a daemon (in C++) and works like a champ. 搜索/添加服务器作为守护程序运行（在C ++中），并且像冠军一样工作。 Typical search times are less than 1/2 second. 典型的搜索时间少于1/2秒。

The problem is on the front end of the process. 问题出在流程的前端。 I typically add 30,000 keys at a time. 我通常一次添加30,000个密钥。 This part of the process takes forever. 该过程的这一部分需要永远的时间。 By way of benchmark, the load time into an empty index of 180,000 variable length keys is approximately 3 1/2 hours. 通过基准测试，装入180,000个可变长度键的空索引中的加载时间约为3 1/2小时。

This scheme works except for the very long load times. 除了非常长的加载时间外，此方案都有效。

Before I go nuts optimizing (or trying to) I'm wondering is whether or not there is a better way to solve this problem. 在我进行优化（或尝试进行优化）之前，我想知道是否有更好的方法来解决此问题。 Front and back wildcard searches (ie: string like '%ppl%' in a DBMS are amazingly slow (on the order of hours in MySQL for example) for datasets this large. So it would seem that DBMS solutions are out of the question. I can't use full-text searches because we are not dealing with normal words, but strings that may or may not be composed of real words. 对于如此大的数据集，前后通配符搜索（例如：DBMS中的字符串，例如'％ppl％'）的速度非常慢（例如，在MySQL中约为小时），因此DBMS解决方案似乎是不可能的。我不能使用全文搜索，因为我们不是在处理普通单词，而是可能由真实单词组成的字符串。

2 个解决方案

From your description, the loading of data takes all that time because you're dealing with I/O, mirroring the inflated strings to hard disk. 根据您的描述，数据加载需要花费所有时间，因为您正在处理I / O，将膨胀的字符串镜像到硬盘。 This will definitely be a bottleneck, mainly depending on the way you read and write data to the disk. 这绝对是一个瓶颈，主要取决于您向磁盘读取和写入数据的方式。

A possible improvement on execution time may be achieved using mmap with some LRU policy. 使用带有某些LRU策略的mmap可以实现执行时间的改善。 I'm quite sure the idea of replicating data is to make the search faster, but since you're using -- as it seems to be -- only one machine, you're bottleneck will go dive from memory searching to I/O requests. 我非常确定复制数据的想法是为了使搜索更快，但是由于您正在使用-似乎只有-一台机器，因此瓶颈将从内存搜索转移到I / O要求。

Another solution, which you may not be interested in -- it's sickly funny and disturbing as well (: --, is to split the data among multiple machines. Considering the way you've structured the data, the implementation itself may take a bit of time, but it would be very straightforward. You'd have: 您可能不感兴趣的另一种解决方案-有趣的是，它也很有趣并且令人不安（：-），将数据拆分到多台计算机上。考虑到数据的结构方式，实现本身可能需要一些时间时间，但这将非常简单。您将拥有：

each machine gets responsible by a set buckets, chosen using something close to hash_id(bucket) % num_machines ; 每台机器都由一组存储桶负责，这些存储桶是使用类似于hash_id(bucket) % num_machines东西选择的；
insertions are performed locally, from each machine; 插入是从每台机器本地执行的；
searches may be either interfaced by some type your query -application, or simply clustered into sets of queries -- if the application is not interative; 搜索可以通过您的查询-应用程序的某种类型进行接口，或者可以简单地聚集成查询集-如果应用程序不是交互式的；
searches may even have the interface distributed, considering you may send start a request from a node, and forward requests to another node (also clustered requests, to avoid excessive I/O overhead). 考虑到您可能从一个节点发送开始请求，然后将请求转发到另一个节点（也包括集群请求，以避免过多的I / O开销），搜索甚至可能具有分布式接口。

Another good point is that, as you said, data is evenly distributed -- ALREADY \\o/; 如您所说，另一个好处是，数据是均匀分布的-已经\\ o /; this is usually one of the pickiest parts of a distributed implementation. 这通常是分布式实现中最挑剔的部分之一。 Besides, this would be highly scalable, as you may add another machine whenever data grows in size. 此外，这将具有很高的可扩展性，因为每当数据大小增加时，您可能会添加另一台计算机。

Instead of doing everything in one pass, solve the problem in 38 passes. 不用一次完成所有操作，而是要通过38次解决问题。

Read each of the 180,000 strings. 读取180,000个字符串中的每个字符串。 Find "A"s in each string, and write out stuff only to the "A" hash table. 在每个字符串中找到“ A”，然后仅将内容写到“ A”哈希表中。 After you are done, write the entire finished result of the "A" hash table out to disk. 完成后，将“ A”哈希表的整个完成结果写到磁盘上。 (have enough RAM to store the entire "A" hash table in memory -- if you don't, make smaller hash tables. Ie, have 38^2 hash tables on pairs of starting letters, and have 1444 different tables. You could even dynamically change how many letters the hash tables are keyed off of have based on how common a prefix they are, so they are all of modest size. Keeping track of how long such prefixes are isn't expensive.) （有足够的RAM可以将整个“ A”哈希表存储在内存中-如果不这样做，则可以制作较小的哈希表。即，在成对的起始字母上具有38 ^ 2哈希表，并具有1444个不同的表。您可以甚至根据前缀的通用性来动态更改哈希表的键号，因此它们的大小都适中。跟踪此类前缀的时长并不昂贵。）

Then read each of the 180,000 strings, looking for "B". 然后读取180,000个字符串中的每个字符串，查找“ B”。 Etc. 等等。

My theory is that you are going slower than you could because of thrashing of your cache of your massive tables. 我的理论是，由于海量表缓存的混乱，您的运行速度可能会比平时慢。

The next thing that might help is to limit how long the strings are you do a hash on, in order to shrink the size of your tables. 下一个可能有用的事情是限制对字符串进行哈希处理的时间，以缩小表的大小。

Instead of doing all 2278 substrings of length 3 to 70 of a string of length 70, if you limited the length of the hash to 10 characters there are only 508 substrings of length 3 to 10. And there may not be that many collisions on strings of length longer than 10. You could, again, have the length of the hashes be dynamic -- the length X hash might have a flag for "try a length X+Y hash if your string is longer than X, this is too common", and otherwise simply terminate the hashing. 如果不将长度为70的字符串的所有2278个子字符串的长度设为3到70，则将哈希的长度限制为10个字符，则只有508个子字符串的长度为3到10。而且字符串上可能不会发生太多冲突长度大于10的字符串。您可以再次使哈希的长度是动态的-长度X哈希可能带有一个标志，表示“如果您的字符串长于X，则尝试长度X + Y哈希，这太常见了”，否则只需终止哈希即可。 That could reduce the amount of data in your tables, at the cost of slower lookup in some cases. 这可能会减少表中的数据量，但在某些情况下会以较慢的查找为代价。