简体   繁体   English

如何有效地检查给定的字符串是否包含数组中的单词

[英]How to efficiently check if a given string contains words from an array

I'm building an app where a user inputs a block of text. 我正在构建一个用户在其中输入文本块的应用程序。 Once submitted I need to check if that block of text contains words from my predefined list of words. 提交后,我需要检查该文本块是否包含预定义单词列表中的单词。
The list of words is large, say around 50K, so I need to figure out a way I can efficiently and quickly do the check. 单词列表很大,大约是50K,所以我需要找出一种可以高效,快速地进行检查的方法。
Here are some solutions I've thought up, but they seem really inefficient 这是我已经想过的一些解决方案,但是它们似乎效率很低

Option 1: Creating a function in the App's code that just loops through each predefined word and checks if that word is in the block of text 选项1:在App的代码中创建一个功能,该功能仅循环遍历每个预定义的单词并检查该单词是否在文本块中

eg 例如

var wordList = ['fox','dog','tree'];  //in my app this list will be large
function contains(userInput) {
    for(i in wordList){
        if(userInput.indexOf(wordList[i]) > -1)
            return true;       
    }
    return false
}

Option 2: Both the block of text and word list will be stored in the DB, so I could do an SQL statement like this 选项2:文本和单词列表块都将存储在数据库中,因此我可以执行这样的SQL语句

eg 例如

SELECT *
FROM UserInput ui
    INNER JOIN WordList wl ON wl.word LIKE CONCAT('%', ui.InputText, '%')

Is there a better way to do this? 有一个更好的方法吗?

If you're looking at anything bigger than a small data set (and 50k qualifies), then I'd definitely do any data manipulation in the database. 如果您要查看的数据大于小型数据集(可以验证50k),那么我肯定会在数据库中进行任何数据处理。

You're correct that an open-ended LIKE isn't going to be terribly performant, but it'll be orders of magnitude faster than doing it outside of a database. 您是正确的,开放式LIKE性能不会很差,但是它比在数据库外部执行时要快几个数量级。 If your user input is guaranteed to be a full word, then you could break everything in WordList in to separate words and do an exact match search. 如果保证您的用户输入是完整单词,那么您可以将WordList中的所有内容WordList为单独的单词,然后进行精确匹配搜索。 If you're not guaranteed to have a full word from UserInput , then I'd use your option 2. 如果不能保证从UserInput获得完整的单词,那么我将使用您的选项2。

If performance is super-important, then you could also look in to full text indices 如果性能非常重要,那么您也可以查看全文索引

Aho–Corasick string matching algorithm Aho–Corasick字符串匹配算法

There is link on C# implementation in article. 文章中有C#实现的链接。

You can do this either in the database or using LINQ. 您可以在数据库中或使用LINQ进行此操作。

Split your user input on the space so you have an Array or Table Valued Parameter containing the user input words. 在空格上分割用户输入,以便您有一个包含用户输入字的数组或表值参数。 Then just do an inner join against your word list. 然后对您的单词列表进行内部联接。 Anything left after the join will be words that exist in both places. 联接之后剩下的所有内容都是在两个地方都存在的单词。 Performance will be excellent. 性能将非常出色。

SELECT SomeColumn
FROM WordList wl
JOIN @tvp ui ON wl.SomeColumn = ui.SomeColumn

It will be orders of magnitude faster than doing LIKE searches, and a lot simpler to set up than full text indexing. 它比进行LIKE搜索要快几个数量级,并且比全文本索引要容易得多。

I would definitely do it on the application side.. but I'm supposing the list as a "bad words" list.. and it won't change often.. if the assumption is correct.. then the code would be something like this 我肯定会在应用程序端这样做。但是我将列表假定为“坏词”列表,并且它不会经常更改..如果假设是正确的,那么代码将类似于这个

static List<String> Chached;
List<String> GetBadWords()
{
    if(Chached==null)
    {
         //load words from db into static array
         Chached.Sort();//!important step
    }
    return Chached
}

public bool IstextValid(String sText)
{
    List<String> oBadWords = GetBadWords()
    foreach(String sWord in Rexex.Split(sText,@"\W"))//split by anything not alphanumeric
        if(oBadWords.BinarySearch(sWord )>=0)//since is sorted we can do binary search O(log n)
            return false;
    return true;
}

do basically there are two optimizations to consider 基本上有两个优化要考虑

  • keep the list in memory, no need to constantly nag sql 将列表保存在内存中,无需不断对SQL进行操作
  • use binary search to avoid O(N * M) 使用二进制搜索来避免O(N * M)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM