简体   繁体   English

如何在C中部分比较两个字符串?

[英]How can I partial compare two strings in C?

Let's say I have the following content: 假设我有以下内容:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

How do I search for dummy or dummy text in that string using C? 如何使用C在该字符串中搜索dummydummy text Is there any easy way to do it or only with strong string manipulation? 有什么简单的方法可以做到这一点,还是只有强大的字符串操作才能做到? All I need is to search for it and return a Boolean with the result. 我所需要做的就是搜索它,并返回一个带有结果的布尔值。

EDIT: 编辑:
You guys created a big discussion around this topic and suggested a few algorithms and I don't mind that cause this might be useful for someone else, or even me in the future. 你们围绕这个话题进行了大讨论,并提出了一些算法,我不介意这可能对其他人甚至将来对我都有用。 But what I really wanted was the most easy way to do it, no matter the time/space complexity. 但是我真正想要的是最简单的方法,无论时间/空间复杂度如何。 That doesn't really matter for what I'm doing. 这对我在做什么并不重要。 So, strstr easily and quickly fixed my problem. 因此, strstr可以轻松快速地解决我的问题。 I really have to get me some standard C functions chet sheet. 我真的必须给我一些标准的C函数速写表。

The standard library function for this is strstr : 标准库函数是strstr

char *strstr(const char *haystack, const char *needle);

It returns a pointer into the string where the match was found, or NULL if it wasn't - so if all you need is a boolean, just test the return value ( if (strstr(...)) . 它将指针返回到找到匹配项的字符串中;如果未找到,则返回if (strstr(...))因此,如果您需要的只是一个布尔值,只需测试返回值( if (strstr(...))

You can use the strstr function if you want something simple and your strings aren't too long. 如果您想要简单的东西并且您的字符串不太长,可以使用strstr函数。 If your strings are very long however, consider the KMP algorithm as it is a lot more efficient. 但是,如果您的字符串很长,请考虑使用KMP算法,因为它效率更高。

I don't really like the Wikipedia article, as the implementation there looks a bit weird to me (although it's probably correct), and it's also misleading about KMP's performance. 我真的不喜欢Wikipedia的文章,因为那里的实现对我来说有点奇怪(尽管可能是正确的),并且它也误导了KMP的性能。 I prefer the implementation and description given here and on other sites returned by a Google search for "KMP algorithm". 我更喜欢此处和Google搜索“ KMP算法”返回的其他网站上给出的实现和说明。

There's an extensive discussion of a large number of string searching algorithms at http://www-igm.univ-mlv.fr/~lecroq/string/ , with illustrative C code and references. http://www-igm.univ-mlv.fr/~lecroq/string/上对大量的字符串搜索算法进行了广泛的讨论,并附有说明性的C代码和参考。

There's a discussion in one set of comments about the costs of the algorithms. 在一组有关算法成本的评论中进行了讨论。 One of the points to bear in mind is that if you can amortize the cost of setup over many invocations of the search function, then the high-performance algorithms can give you enormous benefit. 需要牢记的要点之一是,如果您可以通过多次调用搜索功能来摊销安装成本,那么高性能算法可以为您带来巨大的好处。 If you are going to be searching for different strings all the time, it is harder to win out. 如果您一直都在寻找不同的字符串,那么很难胜出。

I've got a version of the KMP (Knuth-Morris-Pratt) algorithm packaged for multiple reuse of the same search string. 我已经打包了KMP(Knuth-Morris-Pratt)算法的一个版本,可以重复使用同一搜索字符串。 The header is: 标头是:

/*
@(#)File:           $RCSfile: kmp.h,v $
@(#)Version:        $Revision: 1.4 $
@(#)Last changed:   $Date: 2008/02/02 05:49:34 $
@(#)Purpose:        Knuth-Morris-Pratt Search Algorithm
@(#)Author:         J Leffler
@(#)Copyright:      (C) JLSS 2005,2008
@(#)Product:        :PRODUCT:
*/

#ifndef KMP_H
#define KMP_H

#include <stddef.h> /* size_t */

typedef struct kmp_control kmp_control;

/*
** To set up a search (to repeatedly look for the same search string in
** multiple scan strings), use kmp_setsearch().  To start a search on a
** new scan string, use kmp_settarget().  To find the next match of a
** given search string in a given target string, use kmp_search().  Note
** that kmp_setsearch() and kmp_settarget() do not copy the data in the
** source and target strings; the pointers must remain valid You can
** copy kmp_control structures for reuse if desired.
*/
typedef void *(*kmp_malloc)(size_t nbytes);
typedef void (*kmp_free)(void *data);

extern kmp_control *kmp_setsearch(const char *search, size_t schlen);
extern void kmp_settarget(kmp_control *ctrl, const char *target, size_t tgtlen);
extern const char *kmp_search(kmp_control *ctrl);
extern void kmp_release(kmp_control *ctrl);
extern void kmp_setalloc(kmp_malloc mem_alloc, kmp_free mem_free);

#endif /* KMP_H */

Being able to specify memory allocation functions is a tad unusual - but my code often works in an environment where memory allocation is not done via the standard malloc() and so on, and you must be able to switch the memory allocator on demand. 能够指定内存分配功能有点不寻常-但是我的代码通常可以在没有通过标准malloc()等进行内存分配的环境中工作,并且您必须能够按需切换内存分配器。 You can ignore the two typedefs and the corresponding function; 您可以忽略两个typedef和相应的函数; the default settings are, of course, to use malloc() and free() . 当然,默认设置是使用malloc()free()

The basic KMP algorithm code came from the site above - but was modified to allow me to set the search string once and then search multiple target strings, etc. Contact me (see my profile) for the source code. 基本的KMP算法代码来自上述站点-但已进行了修改,使我可以一次设置搜索字符串,然后搜索多个目标字符串,等等。请与我联系(请参阅我的个人资料)以获取源代码。 I have got a similar structure for Boyer-Moore code too (same original source), and also a case-insensitive Boyer-Moore code. 我对Boyer-Moore代码也有类似的结构(相同的原始源代码),并且对大小写不敏感的Boyer-Moore代码也是如此。

There's a good war story about strstr() and performance in Kernighan and Pike's excellent book " The Practice of Programming ". 在Kernighan和Pike的绝妙著作《 编程实践 》中有一个关于strstr()和性能的战争故事。


I did some experimentation - using a copy of the King James Bible (4.8 MB) as the plain text, and memory mapping that. 我做了一些实验-使用James King圣经(4.8 MB)的副本作为纯文本,并对其进行内存映射。 For many searches, the (MacOS X 10.6.2 / BSD) strstr() was faster than either KMP or BM. 对于许多搜索,(MacOS X 10.6.2 / BSD) strstr()比KMP或BM都快。 When the strings grew long enough (12+ characters, approximately), then the BM algorithm finally outpaced strstr() . 当字符串足够长(大约12个以上的字符)时,BM算法最终超过了strstr() The KMP algorithm always seemed to be much slower. 该KMP算法似乎总是要慢得多

Morals? 德?

  • It is hard to out-pace a good library. 很难超越一个好的图书馆。
  • KMP is much slower than BM on plausible English language strings. 在合理的英语字符串上,KMP比BM慢得多。

And the infrastructure I put in place around the algorithms may be too heavy - but the alternative in the original code is a callback mechanism, which presents some problems for determining the context of matches. 而且我围绕算法放置的基础结构可能太沉重了-但是原始代码中的替代方法是回调机制,这给确定匹配上下文带来了一些问题。

I would use strstr (also here ). 我将使用strstr (也在此处 )。

I am not about the use of word "partial" in the question. 我不是要在问题中使用“部分”一词。 The argument ("dummy" or "dummy text") has to be fully matched, right? 参数(“虚拟”或“虚拟文本”)必须完全匹配,对吗?

I've always liked Boyer-Moore, myself. 我一直很喜欢Boyer-Moore。 It is O(n), but must be setup (ie, two tables must be precomputed.) Thus it is good if a lot of text is to be searched, or the search strings are known in advance, thus making up for the cost of building the tables. 它是O(n),但必须进行设置(即必须预先计算两个表。)因此,如果要搜索大量文本,或者事先知道搜索字符串,则很好,这样可以弥补成本桌子的制作。 It is also best for 8-bit ASCII. 这也是8位ASCII的最佳选择。

[ http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm] [ http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm]

(BTW, is there a Unicode flavor of strstr()?) (顺便说一句,strstr()有Unicode风格吗?)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM