[英]Regex vs String.Contains
Hola. 你好。 I'm failing to write a method to test for words within a plain text or html document.
我无法编写一种方法来测试纯文本或html文档中的单词。 I was reasonably literate with regex, and I am newer to c# (from way more java).
我对regex相当了解,并且对c#较新(从更多的java来)。
Just 'cause, 只是因为
string html = source.ToLower();
string plaintext = Regex.Replace(html, @"<(.|\n)*?>", " "); // remove tags
plaintext = Regex.Replace(plaintext, @"\s+", " "); // remove excess white space
and then, 接着,
string tag = "c++";
bool foundAsRegex = Regex.IsMatch(plaintext,@"\b" + Regex.Escape(tag) + @"\b");
bool foundAsContains = plaintext.Contains(tag);
For a case where "c++" should be found, sometimes foundAsRegex is true and sometimes false. 对于应该找到“ c ++”的情况,有时foundAsRegex为true,有时为false。 My google-fu is weak, so I didn't get much back on "what the hell".
我的google-fu很弱,所以我对“到底是什么”的了解不多。 Any ideas or pointers welcome!
任何想法或指针欢迎!
edit: 编辑:
I'm searching for matches on skills in resumes. 我正在寻找简历技能方面的比赛。 for example, the distinct value "c++".
例如,不同的值“ c ++”。
edit: 编辑:
a real excerpt is given below: 真实的摘录如下:
"...administration- c, c++, perl, shell programming..." “ ...管理-c,c ++,perl,shell编程...”
The problem is that \\b
matches between a word character and a non-word character. 问题是
\\b
在单词字符和非单词字符之间匹配。 Given the expression \\bc\\+\\+\\b
, you have a problem. 给定表达式
\\bc\\+\\+\\b
,您有问题。 "+" is a non-word character. “ +”是非单词字符。 So searching for the pattern in "xxx c++, xxx", you're not going to find anything.
因此,在“ xxx c ++,xxx”中搜索模式,您将不会找到任何东西。 There's no "word break" after the "+" character.
“ +”字符后没有“分词”。
If you're looking for non-word characters then you'll have to change your logic. 如果您要查找非单词字符,则必须更改逻辑。 Not sure what the best thing would be.
不知道最好的东西是什么。 I suppose you can use
\\W
, but then it's not going to match at the beginning or end of the line, so you'll need (^|\\W)
and (\\W|$)
... which is ugly. 我想您可以使用
\\W
,但是它不会在行的开头或结尾匹配,因此您需要(^|\\W)
和(\\W|$)
...这很丑。 And slow, although perhaps still fast enough depending on your needs. 并且缓慢,尽管根据您的需求也许仍然足够快。
Your regular expression is turning into: 您的正则表达式将变为:
/\bc\+\+\b/
Which means you're looking for a word boundary, followed by the string c++
, followed by another word boundary. 这意味着您要查找一个单词边界,然后是字符串
c++
,然后是另一个单词边界。 This means it won't match on strings like abc++
, whereas plaintext.Contains
will succeed. 这意味着它将与
abc++
类的字符串不匹配,而plaintext.Contains
将成功。
If you can give us examples of where your regex fails when you expected it to succeed, then we can give you a more definite answer. 如果您可以举例说明您的正则表达式在您期望成功的地方失败,那么我们可以为您提供更明确的答案。
Edit: My original regex was /\\bc++\\b/
, which is incorrect, as c++
is being passed to Regex.Escape()
, which escapes out regular expression metacharacters like +
. 编辑:我原来的正则表达式是
/\\bc++\\b/
,这是不正确的,因为将c++
传递给Regex.Escape()
,它转义了正则表达式元字符,例如+
。 I've fixed it above. 我已经在上面修复了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.