搜索非常大的彩虹表文件

Question

I am looking for the best way to search through a very large rainbow table file (13GB file). 我正在寻找搜索非常大的彩虹表文件（13GB文件）的最佳方法。 It is a CSV-style file, looking something like this: 它是一个CSV样式的文件，看起来像这样：

1f129c42de5e4f043cbd88ff6360486f; somestring
78f640ec8bf82c0f9264c277eb714bcf; anotherstring
4ed312643e945ec4a5a1a18a7ccd6a70; yetanotherstring

... you get the idea - there are about ~900 Million lines, always with a hash, semicolon, clear text string. ...你明白了 - 大约有9亿行，总是带有散列，分号，明文字符串。

So basically, the program should look if a specific hash is lited in this file. 所以基本上，程序应该查看是否在此文件中引用了特定的哈希。

Whats the fastest way to do this? 什么是最快的方式？ Obviously, I can't read the entire file into memory and then put a strstr() on it. 显然，我无法将整个文件读入内存，然后在其上放置一个strstr() 。

So whats the most efficent way to do this? 那么最有效的方法是什么？

read file line by line, always to a strstr() ; 逐行读取文件，总是写入strstr() ;
read larger chunk of the file (eg 10.000 lines), do a strstr() 读取文件的较大块（例如10.000行），执行strstr()

Or would it be more efficient import all this data into an MySQL database and then search for the hash via SQL querys? 或者更高效地将所有这些数据导入MySQL数据库，然后通过SQL查询搜索哈希？

Any help is appreciated 任何帮助表示赞赏

Answer 1

The best way to do it would be to sort it and then use a binary search -like algorithm on it. 最好的方法是对它进行排序，然后在其上使用类似二进制搜索的算法。 After sorting it, it will take around O(log n) time to find a particular entry where n is the number of entries you have. 对它进行排序后，需要大约O（log n）时间来查找特定条目，其中n是您拥有的条目数。 Your algorithm might look like this: 您的算法可能如下所示：

Keep a start offset and end offset. 保持起始偏移和结束偏移。 Initialize the start offset to zero and end offset to the file size. 将起始偏移量初始化为零，并将偏移量结束到文件大小。
If start = end, there is no match. 如果start = end，则没有匹配项。
Read some data from the offset (start + end) / 2. 从偏移量（开始+结束）/ 2读取一些数据。
Skip forward until you see a newline. 向前跳，直到看到换行符。 (You may need to read more, but if you pick an appropriate size (bigger than most of your records) to read in step 3, you probably won't have to read any more.) （您可能需要阅读更多内容，但如果您选择适当的大小（大于您的大多数记录），请在步骤3中阅读，您可能不必再阅读了。）
- If the hash you're on is the hash you're looking for, go on to step 6. 如果您所在的哈希是您正在寻找的哈希，请继续执行步骤6。
- Otherwise, if the hash you're on is less than the hash you're looking for, set start to the current position and go to step 2. 否则，如果您所在的哈希值小于您要查找的哈希值，请将start设置为当前位置并转到步骤2。
- If the hash you're on is greater than the hash you're looking for, set end to the current position and go to step 2. 如果您所在的哈希值大于您要查找的哈希值，请将end设置为当前位置并转到步骤2。
Skip to the semicolon and trailing space. 跳到分号和尾随空格。 The unhashed data will be from the current position to the next newline. 未散列的数据将从当前位置到下一个换行符。

This can be easily converted into a while loop with breaks. 这可以很容易地转换为带有休息的while循环。

Importing it into MySQL with appropriate indices and such would use a similarly (or more, since it's probably packed nicely) efficient algorithm. 使用适当的索引将其导入MySQL并使用类似的（或更多，因为它可能包装得很好）高效算法。

Answer 2

Your last solution might be the easiest one to implement as you move the whole performance optimizing to the database (and usually they are optimized for that). 当您将整个性能优化移动到数据库时，您的最后一个解决方案可能是最容易实现的（通常它们会针对此进行优化）。

strstr is not useful here as it searches a string, but you know a specific format and can jump and compare more goal oriented. strstr在搜索字符串时strstr ，但是你知道一种特定的格式，可以跳转并比较更多的面向目标。 Thing about strncmp , and strchr . 关于strncmp和strchr 。

The overhead for reading a single line would be really high (as it is often the case for file IO). 读取单行的开销非常高（因为文件IO通常就是这种情况）。 So I'd recommend reading a larger chunk and perform your search on that chunk. 因此，我建议您阅读更大的块并在该块上执行搜索。 I'd even think about parallelizing the search by reading the next chunk in another thread and do comparison there aswell. 我甚至考虑通过读取另一个线程中的下一个块来并行化搜索并在那里进行比较。

You can also think about using memory mapped IO instead of the standard C file API. 您还可以考虑使用内存映射IO而不是标准C文件API。 Using this you can leave the whole contents loading to the operating system and don't have to care about caching yourself. 使用它，您可以将整个内容加载到操作系统，而不必关心自己的缓存。

Of course restructuring the data for faster access would help you too. 当然，重组数据以便更快地访问也会对您有所帮助。 For example insert padding bytes so all datasets are equally long. 例如，插入填充字节，因此所有数据集都相等。 This will provide you "random" access to your data stream as you can easily calculate the position of the nth entry. 这将为您提供对数据流的“随机”访问，因为您可以轻松计算第n个条目的位置。

Answer 3

I'd start by splitting the single large file into 65536 smaller files, so that if the hash begins with 0000 it's in the file 00/00data.txt , if the hash begins with 0001 it's in the file 00/01data.txt , etc. If the full file was 12 GiB then each of the smaller files would be (on average) 208 KiB. 我首先将单个大文件拆分为65536个较小的文件，这样如果散列以0000开头，它就在文件00/00data.txt ，如果散列以0001开头，则它位于文件00/01data.txt等中如果完整文件是12 GiB，则每个较小的文件（平均）为208 KiB。

Next, separate the hash from the string; 接下来，将哈希与字符串分开; such that you've got 65536 "hash files" and 65536 "string files". 这样你就有了65536个“哈希文件”和65536个“字符串文件”。 Each hash file would contain the remainder of the hash (the last 12 digits only, because the first 4 digits aren't needed anymore) and the offset of the string in the corresponding string file. 每个哈希文件将包含哈希的剩余部分（仅最后12位，因为不再需要前4位）以及相应字符串文件中字符串的偏移量。 This would mean that (instead of 65536 files at an average of 208 KiB each) you'd have 65536 hash files at maybe 120 KiB each and 65536 string files at maybe 100 KiB each. 这意味着（而不是平均每个208 KiB的65536个文件）你有65536个哈希文件，每个120 KiB，65536个字符串文件，每个可能100 KiB。

Next, the hash files should be in a binary format. 接下来，哈希文件应该是二进制格式。 12 hexadecimal digits costs 48 bits (not 12*8=96-bits). 12个十六进制数字的成本为48位（不是12 * 8 = 96位）。 This alone would halve the size of the hash files. 仅此一项就会使哈希文件的大小减半。 If the strings are aligned on a 4 byte boundary in the strings file then a 16-bit "offset of the string / 4" would be fine (as long as the string file is less than 256 KiB). 如果字符串在字符串文件中的4字节边界上对齐，那么16位“字符串/ 4的偏移量”就可以了（只要字符串文件小于256 KiB）。 Entries in the hash file should be sorted in order, and the corresponding strings file should be in the same order. 哈希文件中的条目应按顺序排序，相应的字符串文件应按相同顺序排序。

After all these changes; 经过所有这些变化; you'd use the highest 16-bits of the hash to find the right hash file, load the hash file and do a binary search. 您将使用散列的最高16位来查找正确的散列文件，加载散列文件并执行二进制搜索。 Then (if found) you'd get the offset for the start of the string (in the strings file) from entry in the hash file, plus get the offset for the next string from next entry in the hash file. 然后（如果找到）你将从哈希文件中的条目获得字符串开头的偏移量（在字符串文件中），并从哈希文件中的下一个条目获取下一个字符串的偏移量。 Then you'd load data from the strings file, starting at the start of the correct string and ending at the start of the next string. 然后，您将从字符串文件加载数据，从正确的字符串开始，到下一个字符串的开头结束。

Finally, you'd implement a "hash file cache" in memory. 最后，您将在内存中实现“哈希文件缓存”。 If your application can allocate 1.5 GiB of RAM, then that'd be enough to cache half of the hash files. 如果您的应用程序可以分配1.5 GiB的RAM，那么这足以缓存一半的哈希文件。 In this case (half the hash files cached) you'd expect that half the time the only thing you'd need to load from disk is the string itself (eg probably less than 20 bytes) and the other half the time you'd need to load the hash file into the cache first (eg 60 KiB); 在这种情况下（缓存的一半哈希文件）你会期望从磁盘加载的唯一一件事就是字符串本身（例如可能少于20个字节）和另一半时间你需要从磁盘加载需要首先将哈希文件加载到缓存中（例如60 KiB）; so on average for each lookup you'd be loading about 30 KiB from disk. 所以平均每次查找你将从磁盘上加载大约30 KiB。 Of course more memory is better (and less is worse); 当然，更多的记忆更好（而且越少越好）; and if you can allocate more than about 3 GiB of RAM you can cache all of the hash files and start thinking about caching some of the strings. 如果你可以分配超过大约3 GiB的RAM，你可以缓存所有的哈希文件，并开始考虑缓存一些字符串。

A faster way would be to have a reversible encoding, so that you can convert a string into an integer and then convert the integer back into the original string without doing any sort of lookup at all. 更快的方法是使用可逆编码，这样您就可以将字符串转换为整数，然后将整数转换回原始字符串，而不进行任何类型的查找。 For an example; 举个例子; if all your strings use lower case ASCII letters and are a max. 如果所有字符串都使用小写ASCII字母并且是最大值。 of 13 characters long, then they could all be converted into a 64-bit integer and back (as 26^13 < 2^63). 如果长度为13个字符，那么它们都可以转换为64位整数并返回（如26 ^ 13 <2 ^ 63）。 This could lead to a different approach - eg use a reversible encoding (with bit 64 of the integer/hash clear) where possible; 这可能导致不同的方法 - 例如在可能的情况下使用可逆编码（整数/散列清除的位64）; and only use some sort of lookup (with bit 64 of the integer/hash set) for strings that can't be encoded in a reversible way. 并且只对可以以可逆方式编码的字符串使用某种查找（对于整数/散列集的位64）。 With a little knowledge (eg carefully selecting the best reversible encoding for your strings) this could slash the size of your 13 GiB file down to "small enough to fit in RAM easily" and be many orders of magnitude faster. 通过一点点知识（例如，仔细选择字符串的最佳可逆编码），这可以将13 GiB文件的大小缩小到“足够小以容易放入RAM”并且速度提高了很多个数量级。

搜索非常大的彩虹表文件

问题描述

3 个解决方案

解决方案1
2 2013-01-26 07:32:01

解决方案2
0 2013-01-26 07:34:52

解决方案3
0 2013-01-26 10:17:06

搜索非常大的彩虹表文件

问题描述

3 个解决方案

解决方案1 2 2013-01-26 07:32:01

解决方案2 0 2013-01-26 07:34:52

解决方案3 0 2013-01-26 10:17:06

解决方案1
2 2013-01-26 07:32:01

解决方案2
0 2013-01-26 07:34:52

解决方案3
0 2013-01-26 10:17:06