在二进制数据中搜索文本

Question

I've a binary data which contains a text.我有一个包含文本的二进制数据。 The text is known.文字是已知的。 What could be a fast method to search for that text:什么可能是搜索该文本的快速方法：

As an eg.作为一个例子。

This is text 1---
!@##$%%#^%&!%^$! <= Assume this line is 3 MB of binary data
Now, This is text 2 ---
!@##$%%#^%&!%^$! <= Assume this line is 2.5 MB of binary data
This is text 3 ---

How can I search for text This is text 2 .如何搜索文本This is text 2 。

Currently I'm doing like:目前我正在这样做：

size_t count = 0;
size_t s_len = strlen("This is text 2");

//Assume data_len is length of the data from which text is to be found and data is pointer (char*) to the start of it.
for(; count < data_len; ++count)
{
    if(!memcmp("This is text 2", data + count, s_len)
    {
         printf("%s\n", "Hurray found you...");
    }
}

Is there any other way, more efficient way to do this有没有其他方法，更有效的方法来做到这一点
Will replacing ++count logic with memchr('T') logic help <= Please ignore if this statement is not clear将++count logic替换为memchr('T') logic帮助 <= 如果此语句不清楚请忽略
what should be the average case big-O comlexity of memchr memchr 的大 O 复杂度的平均情况应该是多少

Answer 1

There are algorithms for doing exactly this with better complexity than repeated memcmp (which is implemented the obvious way and has the obvious complexity for near matches).有一些算法可以比重复的memcmp更复杂地做到这一点（这是以明显的方式实现的，并且对于近似匹配具有明显的复杂性）。

Famous algorithms are Boyer-Moore and Knuth-Morris-Pratt .著名的算法是Boyer-Moore和Knuth-Morris-Pratt 。 These are only two examples.这只是两个例子。 The general category in which these fall is "string matching".这些属于的一般类别是“字符串匹配”。

Answer 2

There's nothing in standard C to help you, but there is a GNU extension memmem() that does this:标准 C 中没有任何内容可以帮助您，但是有一个 GNU 扩展memmem()可以做到这一点：

#define TEXT2 "This is text 2"

char *pos = memmem(data, data_len, TEXT2, sizeof(TEXT2));

if (pos != NULL)
    /* Found it. */

If you need to be portable to systems that don't have this, you could take the glibc implementation of memmem() and incorporate it into your program.如果您需要移植到没有此功能的系统，您可以采用glibc的memmem()实现并将其合并到您的程序中。

Answer 3

I know that the question is about C programming language, but have you tried to use strings unix tool: http://en.wikipedia.org/wiki/Strings_(Unix ) with grep ? I know that the question is about C programming language, but have you tried to use strings unix tool: http://en.wikipedia.org/wiki/Strings_(Unix ) with grep ?

$ strings datafile | grep "your text"

EDIT:编辑：

If you want to use C, I suggest to do this simple optimization:如果你想使用 C，我建议做这个简单的优化：

size_t count = 0;
size_t s_len = strlen("This is text 2");

for(; count < data_len; ++count)
{
    if (!isprint(data[count])) continue;

    if(!memcmp("This is text 2", data + count, s_len)
    {
     printf("%s\n", "Hurray found you...");
    }
}

If you want a better performance, I suggest you to search and use a string matching algorithm.如果您想要更好的性能，我建议您搜索并使用字符串匹配算法。

在二进制数据中搜索文本

问题描述

3 个解决方案

解决方案1
4 2011-05-26 08:31:31

解决方案2
4 已采纳 2011-05-26 09:35:37

解决方案3
0 2011-05-26 08:43:59

在二进制数据中搜索文本

问题描述

3 个解决方案

解决方案1 4 2011-05-26 08:31:31

解决方案2 4 已采纳 2011-05-26 09:35:37

解决方案3 0 2011-05-26 08:43:59

解决方案1
4 2011-05-26 08:31:31

解决方案2
4 已采纳 2011-05-26 09:35:37

解决方案3
0 2011-05-26 08:43:59