简体   繁体   English

正则表达式vs字符串包含

[英]Regex vs String.Contains

Hola. 你好。 I'm failing to write a method to test for words within a plain text or html document. 我无法编写一种方法来测试纯文本或html文档中的单词。 I was reasonably literate with regex, and I am newer to c# (from way more java). 我对regex相当了解,并且对c#较新(从更多的java来)。

Just 'cause, 只是因为

string html = source.ToLower();
string plaintext = Regex.Replace(html, @"<(.|\n)*?>", " "); // remove tags
plaintext = Regex.Replace(plaintext, @"\s+", " "); // remove excess white space

and then, 接着,

string tag = "c++";
bool foundAsRegex = Regex.IsMatch(plaintext,@"\b" + Regex.Escape(tag) + @"\b");
bool foundAsContains = plaintext.Contains(tag);

For a case where "c++" should be found, sometimes foundAsRegex is true and sometimes false. 对于应该找到“ c ++”的情况,有时foundAsRegex为true,有时为false。 My google-fu is weak, so I didn't get much back on "what the hell". 我的google-fu很弱,所以我对“到底是什么”的了解不多。 Any ideas or pointers welcome! 任何想法或指针欢迎!

edit: 编辑:

I'm searching for matches on skills in resumes. 我正在寻找简历技能方面的比赛。 for example, the distinct value "c++". 例如,不同的值“ c ++”。

edit: 编辑:

a real excerpt is given below: 真实的摘录如下:

"...administration- c, c++, perl, shell programming..." “ ...管理-c,c ++,perl,shell编程...”

The problem is that \\b matches between a word character and a non-word character. 问题是\\b在单词字符和非单词字符之间匹配。 Given the expression \\bc\\+\\+\\b , you have a problem. 给定表达式\\bc\\+\\+\\b ,您有问题。 "+" is a non-word character. “ +”是非单词字符。 So searching for the pattern in "xxx c++, xxx", you're not going to find anything. 因此,在“ xxx c ++,xxx”中搜索模式,您将不会找到任何东西。 There's no "word break" after the "+" character. “ +”字符后没有“分词”。

If you're looking for non-word characters then you'll have to change your logic. 如果您要查找非单词字符,则必须更改逻辑。 Not sure what the best thing would be. 不知道最好的东西是什么。 I suppose you can use \\W , but then it's not going to match at the beginning or end of the line, so you'll need (^|\\W) and (\\W|$) ... which is ugly. 我想您可以使用\\W ,但是它不会在行的开头或结尾匹配,因此您需要(^|\\W)(\\W|$) ...这很丑。 And slow, although perhaps still fast enough depending on your needs. 并且缓慢,尽管根据您的需求也许仍然足够快。

Your regular expression is turning into: 您的正则表达式将变为:

/\bc\+\+\b/

Which means you're looking for a word boundary, followed by the string c++ , followed by another word boundary. 这意味着您要查找一个单词边界,然后是字符串c++ ,然后是另一个单词边界。 This means it won't match on strings like abc++ , whereas plaintext.Contains will succeed. 这意味着它将与abc++类的字符串不匹配,而plaintext.Contains将成功。

If you can give us examples of where your regex fails when you expected it to succeed, then we can give you a more definite answer. 如果您可以举例说明您的正则表达式在您期望成功的地方失败,那么我们可以为您提供更明确的答案。

Edit: My original regex was /\\bc++\\b/ , which is incorrect, as c++ is being passed to Regex.Escape() , which escapes out regular expression metacharacters like + . 编辑:我原来的正则表达式是/\\bc++\\b/ ,这是不正确的,因为将c++传递给Regex.Escape() ,它转义了正则表达式元字符,例如+ I've fixed it above. 我已经在上面修复了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM