正则表达式vs字符串包含

Question

Hola. 你好。 I'm failing to write a method to test for words within a plain text or html document. 我无法编写一种方法来测试纯文本或html文档中的单词。 I was reasonably literate with regex, and I am newer to c# (from way more java). 我对regex相当了解，并且对c＃较新（从更多的java来）。

Just 'cause, 只是因为

string html = source.ToLower();
string plaintext = Regex.Replace(html, @"<(.|\n)*?>", " "); // remove tags
plaintext = Regex.Replace(plaintext, @"\s+", " "); // remove excess white space

and then, 接着，

string tag = "c++";
bool foundAsRegex = Regex.IsMatch(plaintext,@"\b" + Regex.Escape(tag) + @"\b");
bool foundAsContains = plaintext.Contains(tag);

For a case where "c++" should be found, sometimes foundAsRegex is true and sometimes false. 对于应该找到“ c ++”的情况，有时foundAsRegex为true，有时为false。 My google-fu is weak, so I didn't get much back on "what the hell". 我的google-fu很弱，所以我对“到底是什么”的了解不多。 Any ideas or pointers welcome! 任何想法或指针欢迎！

edit: 编辑：

I'm searching for matches on skills in resumes. 我正在寻找简历技能方面的比赛。 for example, the distinct value "c++". 例如，不同的值“ c ++”。

edit: 编辑：

a real excerpt is given below: 真实的摘录如下：

"...administration- c, c++, perl, shell programming..." “ ...管理-c，c ++，perl，shell编程...”

Answer 1

The problem is that \\b matches between a word character and a non-word character. 问题是\\b在单词字符和非单词字符之间匹配。 Given the expression \\bc\\+\\+\\b , you have a problem. 给定表达式\\bc\\+\\+\\b ，您有问题。 "+" is a non-word character. “ +”是非单词字符。 So searching for the pattern in "xxx c++, xxx", you're not going to find anything. 因此，在“ xxx c ++，xxx”中搜索模式，您将不会找到任何东西。 There's no "word break" after the "+" character. “ +”字符后没有“分词”。

If you're looking for non-word characters then you'll have to change your logic. 如果您要查找非单词字符，则必须更改逻辑。 Not sure what the best thing would be. 不知道最好的东西是什么。 I suppose you can use \\W , but then it's not going to match at the beginning or end of the line, so you'll need (^|\\W) and (\\W|$) ... which is ugly. 我想您可以使用\\W ，但是它不会在行的开头或结尾匹配，因此您需要(^|\\W)和(\\W|$) ...这很丑。 And slow, although perhaps still fast enough depending on your needs. 并且缓慢，尽管根据您的需求也许仍然足够快。

Answer 2

Your regular expression is turning into: 您的正则表达式将变为：

/\bc\+\+\b/

Which means you're looking for a word boundary, followed by the string c++ , followed by another word boundary. 这意味着您要查找一个单词边界，然后是字符串c++ ，然后是另一个单词边界。 This means it won't match on strings like abc++ , whereas plaintext.Contains will succeed. 这意味着它将与abc++类的字符串不匹配，而plaintext.Contains将成功。

If you can give us examples of where your regex fails when you expected it to succeed, then we can give you a more definite answer. 如果您可以举例说明您的正则表达式在您期望成功的地方失败，那么我们可以为您提供更明确的答案。

Edit: My original regex was /\\bc++\\b/ , which is incorrect, as c++ is being passed to Regex.Escape() , which escapes out regular expression metacharacters like + . 编辑：我原来的正则表达式是/\\bc++\\b/ ，这是不正确的，因为将c++传递给Regex.Escape() ，它转义了正则表达式元字符，例如+ 。 I've fixed it above. 我已经在上面修复了。

正则表达式vs字符串包含

问题描述

2 个解决方案

解决方案1
4 已采纳 2011-02-18 20:17:59

解决方案2
1 2011-02-18 19:23:14

正则表达式vs字符串包含

问题描述

2 个解决方案

解决方案1 4 已采纳 2011-02-18 20:17:59

解决方案2 1 2011-02-18 19:23:14

解决方案1
4 已采纳 2011-02-18 20:17:59

解决方案2
1 2011-02-18 19:23:14