简体繁体 English

使用C或C ++在大型二进制文件中查找模式？

[英]Finding pattern in large binary file using C or C++?

原文 2011-02-18 22:35:35 1 3 c++/ c/ file/ search/ design-patterns

I have a ~700 MB binary file (non-text data); 我有一个~700 MB的二进制文件（非文本数据）; what I would like to do is search for a specific pattern of bytes that occurs in random locations throughout the file. 我想要做的是搜索整个文件中随机位置发生的特定字节模式。 eg 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 例如0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 and so on for 50 or so bytes in sequence. 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 ，依此类推50个左右的字节。 The pattern I'd be searching for would be a sequence two random bytes with 0x55 occurring every two bytes. 我要搜索的模式是两个随机字节的序列，每两个字节出现0x55。

That is, search for tables stored in the file with 0x55 being the delimiter, and then save the data contained in the tables or otherwise manipulate it. 也就是说，搜索存储在文件中的表，其中0x55是分隔符，然后保存表中包含的数据或以其他方式操纵它。

Would the best option be simply going through every individual byte one at a time, and then looking ahead two bytes to see if the value is 0x55, and if it is, then looking ahead again and again to confirm that a table exists in that location? 最好的选择是简单地一次遍历每个字节，然后向前看两个字节以查看值是否为0x55，如果是，则再次向前看以确认该位置是否存在于该位置？

Load the whole thing? 加载整个东西？ fseek? FSEEK？ Buffer chunks, searching those one byte at a time? 缓冲区块，一次搜索那一个字节？

What would be the best way of looking through this large file, and finding the pattern, using C or C++? 查看这个大文件，并使用C或C ++查找模式的最佳方法是什么？

3 个解决方案

This sounds like a great job for a regular expression matcher or a deterministic finite automaton . 对于正则表达式匹配器或确定性有限自动机来说，这听起来很棒。 These are high-power tools designed to do just what you're asking, and if you have them at your disposal you shouldn't have much trouble doing this sort of search. 这些是高功率工具，旨在满足您的需求，如果您拥有它们，那么您可以毫不费力地进行此类搜索。 In C++, consider looking into the Boost.Regex libraries, which should have all the functionality you need to knock this problem down. 在C ++中，考虑查看Boost.Regex库，它应该具有解决此问题所需的所有功能。

What ultimately worked for me was a hybrid between the Boyer-Moore-Horspool algorithm (suggested by Jerry Coffin) and my own algorithm based on the structure of the tables and the data being stored. 最终对我有用的是Boyer-Moore-Horspool算法（由Jerry Coffin建议）和我自己的基于表结构和存储数据的算法之间的混合。

Basically, the BMH algorithm caught most of the things I was looking for. 基本上，BMH算法捕获了我正在寻找的大部分内容。 The obvious stuff. 显而易见的事情。

But some tables did turn out to have odd formatting, and I had to implement a semi-intelligent search that would look at the data following each 0x55 , and figure out whether or not it was it was likely to be good data, or just random junk. 但有些表确实有奇怪的格式，我不得不实现一个半智能搜索，它会查看每个0x55后的数据，并弄清楚它是否可能是好的数据，或者只是随机的垃圾。

Oddly enough, I ended up implementing it in PHP rather than C++, and dumping the results right into a MySQL database for querying. 奇怪的是，我最终用PHP而不是C ++实现它，并将结果直接转储到MySQL数据库中进行查询。 The search process only took around 5 minutes or less, and the results were largely good. 搜索过程只花了大约5分钟或更短时间，结果非常好。 I did end up with a lot of junk data, but it caught everything that I needed it to, and (as far as I'm aware) did not leave any good data behind. 我最终得到了大量的垃圾数据，但它抓住了我需要它的所有内容，并且（据我所知）并没有留下任何好的数据。

Load the whole thing? 加载整个东西？ fseek? FSEEK？ Buffer chunks, searching those one byte at a time? 缓冲区块，一次搜索那一个字节？

If you can load the whole thing into memory, you should probably use the memory mapping features provided by your platform. 如果您可以将整个内容加载到内存中，则应该使用平台提供的内存映射功能。 This way, the operating system can decide if it should keep large portions of the file in physical memory (ie the system has lots of free RAM at the moment), or if it should work only in smaller chunks. 这样，操作系统可以决定是否应该将文件的大部分保留在物理内存中（即系统目前有大量的空闲RAM），或者它是否应该只在较小的块中工作。

Of course, this only works if you can fit the file into working set. 当然，这只适用于您可以将文件放入工作集中。