简体   繁体   English

如何从字符串中删除干扰词并使用RegEx搜索? C#

[英]how to remove noise words from a string and search it using RegEx? C#

I am trying to perform a search for a string within a string. 我正在尝试在字符串中搜索字符串。

StringToSearch: The quick brown fox jumped over the fence StringToSearch: The quick brown fox jumped over the fence
searchTerm: brown jumped searchTerm: brown jumped

so when i do a StringToSearch.ContainsEx(searchTerm) it returns true. 因此,当我执行StringToSearch.ContainsEx(searchTerm)它返回true。 So the way I have it working now is, I first remove nosie words using string.Remove() then do a string.Split(' ') to get the words and then perform a contains search on all words from this array in the text to be searched. 因此,我现在的工作方式是,我首先使用string.Remove()删除string.Remove()单词,然后执行string.Split(' ')以获取单词,然后对文本中此数组中的所有单词执行包含搜索进行搜索。

It works but I want it to make as performant as I can, so can I make use of RegEx to do the same kind of search? 它可以工作,但是我希望它尽可能地表现出色,所以我可以利用RegEx进行相同的搜索吗? ie 1) Remove noise words like the , of etc and then see if all words in the searchString are contained within the text to be searched? 即1)消除噪声的话像theof等,然后看是否在所有单词searchString包含文本中要搜索?

I have no idea on uisng RegEx's in C# at all so code sample would be helpful. 我完全不知道在C#中使用RegEx的用法,因此代码示例会有所帮助。 Thank you and please suggest any other techniques if you feel that they will serve me better than Regular expressions. 谢谢,如果您觉得其他技术比正则表达式更适合我,请提出其他建议。

Try this(If you need, add more words like similar fashion): 试试这个(如果需要,添加更多类似方式的单词):

string sPattern = "(?=.*\bbrown\b)(?=.*\bjumped\b)"
if (System.Text.RegularExpressions.Regex.IsMatch(mainString, sPattern))
{
    // do something
}

(?=.*\\bbrown\\b) = Using positive lookahead it is checking if the word brown exists in the text. (?=.*\\bbrown\\b) =通过正向查找,它正在检查文本中是否存在brown一词。 \\b is word boundary, so that it doesn't pick the word from another. \\b是单词边界,因此它不会从另一个单词中选取单词。 For example avoiding and from the word land 例如避免andland一词

Try using Linq , I think it will be good if both of your strings are long. 尝试使用Linq ,如果两个字符串都长,我认为这会很好。 Using regex you first have to contruct a regex dynamically (for each element of searchTerm) and you would end up with a long regex, that might be slow. 使用正则表达式,您首先必须动态地构造一个正则表达式(针对searchTerm的每个元素),最终会得到一个长的正则表达式,这可能很慢。

List<string> StringToSearchList = new List<string>(StringToSearch.Split(' '));
List<string> searchTermList = new List<string>(searchTerm.Split(' '));

var query = StringToSearchList.Select(c => c).Except(searchTermList);

You can use string.Join to convert array to a string . 您可以使用string.Joinarray转换为string

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM