[英]Searching text in binary data
I've a binary data which contains a text.我有一个包含文本的二进制数据。 The text is known.文字是已知的。 What could be a fast method to search for that text:什么可能是搜索该文本的快速方法:
As an eg.作为一个例子。
This is text 1--- !@##$%%#^%&!%^$! <= Assume this line is 3 MB of binary data Now, This is text 2 --- !@##$%%#^%&!%^$! <= Assume this line is 2.5 MB of binary data This is text 3 ---
How can I search for text This is text 2
.如何搜索文本This is text 2
。
Currently I'm doing like:目前我正在这样做:
size_t count = 0;
size_t s_len = strlen("This is text 2");
//Assume data_len is length of the data from which text is to be found and data is pointer (char*) to the start of it.
for(; count < data_len; ++count)
{
if(!memcmp("This is text 2", data + count, s_len)
{
printf("%s\n", "Hurray found you...");
}
}
++count logic
with memchr('T') logic
help <= Please ignore if this statement is not clear将++count logic
替换为memchr('T') logic
帮助 <= 如果此语句不清楚请忽略There are algorithms for doing exactly this with better complexity than repeated memcmp
(which is implemented the obvious way and has the obvious complexity for near matches).有一些算法可以比重复的memcmp
更复杂地做到这一点(这是以明显的方式实现的,并且对于近似匹配具有明显的复杂性)。
Famous algorithms are Boyer-Moore and Knuth-Morris-Pratt .著名的算法是Boyer-Moore和Knuth-Morris-Pratt 。 These are only two examples.这只是两个例子。 The general category in which these fall is "string matching".这些属于的一般类别是“字符串匹配”。
There's nothing in standard C to help you, but there is a GNU extension memmem()
that does this:标准 C 中没有任何内容可以帮助您,但是有一个 GNU 扩展memmem()
可以做到这一点:
#define TEXT2 "This is text 2"
char *pos = memmem(data, data_len, TEXT2, sizeof(TEXT2));
if (pos != NULL)
/* Found it. */
If you need to be portable to systems that don't have this, you could take the glibc
implementation of memmem()
and incorporate it into your program.如果您需要移植到没有此功能的系统,您可以采用glibc
的memmem()
实现并将其合并到您的程序中。
I know that the question is about C programming language, but have you tried to use strings unix tool: http://en.wikipedia.org/wiki/Strings_(Unix ) with grep ? I know that the question is about C programming language, but have you tried to use strings unix tool: http://en.wikipedia.org/wiki/Strings_(Unix ) with grep ?
$ strings datafile | grep "your text"
EDIT:编辑:
If you want to use C, I suggest to do this simple optimization:如果你想使用 C,我建议做这个简单的优化:
size_t count = 0;
size_t s_len = strlen("This is text 2");
for(; count < data_len; ++count)
{
if (!isprint(data[count])) continue;
if(!memcmp("This is text 2", data + count, s_len)
{
printf("%s\n", "Hurray found you...");
}
}
If you want a better performance, I suggest you to search and use a string matching algorithm.如果您想要更好的性能,我建议您搜索并使用字符串匹配算法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.