在二進制數據中搜索文本

Question

我有一個包含文本的二進制數據。 文字是已知的。 什么可能是搜索該文本的快速方法：

作為一個例子。

This is text 1---
!@##$%%#^%&!%^$! <= Assume this line is 3 MB of binary data
Now, This is text 2 ---
!@##$%%#^%&!%^$! <= Assume this line is 2.5 MB of binary data
This is text 3 ---

如何搜索文本This is text 2 。

目前我正在這樣做：

size_t count = 0;
size_t s_len = strlen("This is text 2");

//Assume data_len is length of the data from which text is to be found and data is pointer (char*) to the start of it.
for(; count < data_len; ++count)
{
    if(!memcmp("This is text 2", data + count, s_len)
    {
         printf("%s\n", "Hurray found you...");
    }
}

有沒有其他方法，更有效的方法來做到這一點
將++count logic替換為memchr('T') logic幫助 <= 如果此語句不清楚請忽略
memchr 的大 O 復雜度的平均情況應該是多少

Answer 1

有一些算法可以比重復的memcmp更復雜地做到這一點（這是以明顯的方式實現的，並且對於近似匹配具有明顯的復雜性）。

著名的算法是Boyer-Moore和Knuth-Morris-Pratt 。 這只是兩個例子。 這些屬於的一般類別是“字符串匹配”。

Answer 2

標准 C 中沒有任何內容可以幫助您，但是有一個 GNU 擴展memmem()可以做到這一點：

#define TEXT2 "This is text 2"

char *pos = memmem(data, data_len, TEXT2, sizeof(TEXT2));

if (pos != NULL)
    /* Found it. */

如果您需要移植到沒有此功能的系統，您可以采用glibc的memmem()實現並將其合並到您的程序中。

Answer 3

I know that the question is about C programming language, but have you tried to use strings unix tool: http://en.wikipedia.org/wiki/Strings_(Unix ) with grep ?

$ strings datafile | grep "your text"

編輯：

如果你想使用 C，我建議做這個簡單的優化：

size_t count = 0;
size_t s_len = strlen("This is text 2");

for(; count < data_len; ++count)
{
    if (!isprint(data[count])) continue;

    if(!memcmp("This is text 2", data + count, s_len)
    {
     printf("%s\n", "Hurray found you...");
    }
}

如果您想要更好的性能，我建議您搜索並使用字符串匹配算法。

在二進制數據中搜索文本

問題描述

3 個解決方案

解決方案1
4 2011-05-26 08:31:31

解決方案2
4 已采納 2011-05-26 09:35:37

解決方案3
0 2011-05-26 08:43:59

在二進制數據中搜索文本

問題描述

3 個解決方案

解決方案1 4 2011-05-26 08:31:31

解決方案2 4 已采納 2011-05-26 09:35:37

解決方案3 0 2011-05-26 08:43:59

解決方案1
4 2011-05-26 08:31:31

解決方案2
4 已采納 2011-05-26 09:35:37

解決方案3
0 2011-05-26 08:43:59