C ++搜索算法-处理大量数据

Question

I have a code that searches in files for a string, File(s) can be 1mg of size or 1gig or bigger. 我有一个代码在文件中搜索字符串，文件的大小可以为1mg或1gig或更大。

I get the file data with ReadFile() WinAPI and Convert into Hex, Then search for a string (Which is hexed before) in the Converted Data. 我使用ReadFile() WinAPI获取文件数据并转换为十六进制，然后在转换后的数据中搜索字符串（之前是十六进制的）。

I used this code for search (string search): 我使用以下代码进行搜索（字符串搜索）：

std::string searchStr = "48656C6C6FA"
std::string fileData = ToHex(inputString);

if(fileData.find(searchStr, 0) != std::string::npos)
{
    std::cout << FileName;
}

It takes almost 11 sec to search for string in 2900 files. 在2900个文件中搜索字符串大约需要11秒。

Is there any other search algorithm or function to be faster? 还有其他更快的搜索算法或功能吗？ This way (above) sometimes missed the string and not perfectly works. 这种方式（上面）有时会遗漏字符串，但效果并不理想。

Answer 1

If you have a smaller file (like a few megabytes, or even a couple of hundred megabytes, depending on the amount of memory your system have) then read it all into memory, otherwise I recommend using memory mapped files . 如果您有一个较小的文件（如几兆字节，甚至几百兆字节，具体取决于系统拥有的内存量），则将其全部读取到内存中，否则建议使用内存映射文件 。 If the file is to big to be mapped you can use a sliding window or double-buffering algorithm to read blocks of the data from the file into memory. 如果要映射的文件太大，则可以使用滑动窗口或双缓冲算法将数据块从文件读取到内存中。

Then to search for a specific sequence of bytes, you do a linear search through the contents of the file, looking for the first byte of the sequence you search for (in the case of 0x48656C6C6FA that's 0xFA ) . 然后，要搜索特定的字节序列，可以对文件的内容进行线性搜索，查找要搜索的序列的第一个字节（对于0x48656C6C6FA为0xFA ）。 If found then you attempt to match the second byte in the sequence (in the example that's 0xC6 ) to the next byte from the file, and so on until you have matched the whole sequence. 如果找到，则尝试将序列中的第二个字节（在示例中为0xC6 ）与文件中的下一个字节进行匹配，依此类推，直到匹配了整个序列。

If the second (or continuing) byte doesn't match, you continue your search for the first byte. 如果第二个（或连续的）字节不匹配，则继续搜索第一个字节。

This has O(n) complexity, where n is the number of bytes in the file. 这具有O（n）复杂度，其中n是文件中的字节数。 Unless you know beforehand that the data you search for is in a specific part of the file, that's the best you're going to get. 除非事先知道要搜索的数据在文件的特定部分中，否则这将是最好的选择。

If the files exists on an SSD you can use threads to search, one thread per file. 如果文件位于SSD上，则可以使用线程搜索，每个文件一个线程。 But not all 2900 files at once, that will swamp the processor. 但是并非同时所有2900个文件都会淹没处理器。 Instead have 4-8 threads doing the search (depending on the number of cores of your system), and as soon as one thread is finished with a file, then it takes the next. 取而代之的是让4-8个线程进行搜索（取决于系统的内核数），并且一旦一个线程完成了一个文件，它就会占用下一个线程。

Can't be used on a spinning-disk drive, as it will thrash the disk while the heads are seeking back and forth as the threads are trying to read. 不能在旋转磁盘驱动器上使用，因为当线程尝试读取时，当磁头来回搜索时，它将破坏磁盘。

Answer 2

Speed: use a memory mapped file 速度：使用内存映射文件

Accuracy: use std::search using binary values. 精度：使用std :: search使用二进制值。

eg 例如

#include <algorithm>
#include <cstdint>
#include <tuple>
#include <vector>

// some function to return a pointer to the first byte in the file and the length 
extern std::tuple<const std::uint8_t*, std::size_t> get_file_bounds();

int main()
{
    auto [begin, size] = get_file_bounds();
    auto search_string = std::vector<std::uint8_t> {
        0x48,
        0x65,
        0x6C,
        0x6C,
        0x6F
    };

    auto iter = std::search(begin, begin + size, 
                            search_string.begin(), search_string.end());

    if (iter != begin + size)
    {
        // found the sequence 
    }
    else 
    {
        // didn't find it
    }

}

Answer 3

For search strings as short as the ones you have (5 1/2 bytes apparently), the bottleneck will often be the disk I/O. 对于与您的搜索字符串一样短的搜索字符串（显然是5 1/2字节），瓶颈通常是磁盘I / O。 I suspect that those 2900 files may be on a harddisk. 我怀疑那2900个文件可能在硬盘上。 That would translate to roughly 4 ms per file, which is quite decent. 这相当于每个文件大约4毫秒，这相当不错。

Sure, the conversion to hex may be a bit clumsy, but given the 5 1/2 bytes (11 hex digits) it might not be entirely unreasonable. 当然，转换为十六进制可能有点笨拙，但是考虑到5 1/2字节（11个十六进制数字），这可能并非完全不合理。 Ie you might not get a major speed improvement if the HDD is the real bottleneck. 也就是说，如果HDD是真正的瓶颈，那么您可能无法获得重大的速度提升。

So to check, measure how much time you spend if you don't search in the 2900 fies, and just read them in. Don't even convert them to hex. 因此，要检查一下，如果不搜索2900个fies，而只是读入它们，则测量一下您花费了多少时间。甚至不要将它们转换为十六进制。 No matter how smart the search algorithm, the time you'll need for disk I/O is a lower bound. 无论搜索算法多么智能，磁盘I / O所需的时间都是一个下限。 If this isn't good enough, get a fast SSD. 如果这还不够好，请获取快速的SSD。

Answer 4

For a faster string search algorithm, take a look at the Boyer Moore search algorithm. 对于更快的字符串搜索算法，请看一下Boyer Moore搜索算法。 Boost (and c++17) has such an implementation. Boost（和c ++ 17）具有这样的实现。

Also, avoid converting the file into hex (std::strings can contain '\\0' characters). 另外，请避免将文件转换为十六进制（std :: strings可以包含'\\ 0'字符）。

And if you file IO is limiting, memory mapped files might be the way forward.. 如果文件IO受到限制，则内存映射文件可能是前进的方向。

Answer 5

While this is probably a storage bottleneck problem there are string search algorithms that can be significantly faster than linear, for example Boyer Moore (described at https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm ), they do require processing the search pattern and ave some memory overhead compared to a linear search. 尽管这可能是一个存储瓶颈问题，但有些字符串搜索算法可能比线性算法快得多，例如，Boyer Moore（在https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm中进行了介绍）与线性搜索相比，确实需要处理搜索模式并节省一些内存开销。

The basic idea is to know how many characters can be skipped based on what you find at a given index. 基本思想是根据在给定索引处找到的内容来知道可以跳过多少个字符。 (iE start at fileData[patternLLen-1] and if the character isn't even in the search pattern you can next look at fileData[patternLen+patternLen-1] and so on. （即，从fileData [patternLLen-1]开始，如果字符甚至不在搜索模式中，则可以接下来查看fileData [patternLen + patternLen-1]，依此类推。

The longer your pattern the more likely such an algorithm is to be an improvement over a straight linear search. 模式越长，这种算法越可能是对线性搜索的一种改进。 The boost library already has implementations of several such improved string search algorithms (found in boost/algorithm\\searching/). boost库已经实现了几种改进的字符串搜索算法的实现（可在boost / algorithm \\ searching /中找到）。

C ++搜索算法-处理大量数据

问题描述

5 个解决方案

解决方案1
5 已采纳 2017-10-18 12:12:09

解决方案2
3 2017-10-18 12:21:05

解决方案3
1 2017-10-18 12:14:00

解决方案4
1 2017-10-18 12:31:00

解决方案5
1 2017-10-29 18:35:28

C ++搜索算法-处理大量数据

问题描述

5 个解决方案

解决方案1 5 已采纳 2017-10-18 12:12:09

解决方案2 3 2017-10-18 12:21:05

解决方案3 1 2017-10-18 12:14:00

解决方案4 1 2017-10-18 12:31:00

解决方案5 1 2017-10-29 18:35:28

解决方案1
5 已采纳 2017-10-18 12:12:09

解决方案2
3 2017-10-18 12:21:05

解决方案3
1 2017-10-18 12:14:00

解决方案4
1 2017-10-18 12:31:00

解决方案5
1 2017-10-29 18:35:28