简体   繁体   中英

Fast way to use String.Contains with huge list C#

I have somethings like this:

List<string> listUser;
listUser.Add("user1");
listUser.Add("user2");
listUser.Add("userhacker");
listUser.Add("user1other");

List<string> key_blacklist;
key_blacklist.Add("hacker");
key_blacklist.Add("other");

foreach (string user in listUser)
{
   foreach (string key in key_blacklist)
   {
      if (user.Contains(key))
      {
           // remove it in listUser
      }
   }
}

The result of listUser is: user1, user2.

The problem is if i have a huge listUser (more than 10 million) and huge key_blacklist (100.000). That code is very very slow. Is have anyway to get that faster?

UPDATE: I find new solution in there. http://cc.davelozinski.com/c-sharp/fastest-way-to-check-if-a-string-occurs-within-a-string Hope that will help someone when he got in there! :)

If you don't have much control over how the list of users is constructed, you can at least test each item in the list in parallel, which on modern machines with multiple cores will speed up the checking a fair bit.

        listuser.AsParallel().Where(
            s =>
            {
                foreach (var key in key_blacklist)
                {
                    if (s.Contains(key))
                    {
                        return false; //Not to be included
                    }
                }

                return true; //To be included, as no match with the blacklist
            });

Also - do you have to use .Contains? .Equals is going to be much much quicker, because in almost all cases a non-match will be determined when the HashCodes differ, which can be found only by an integer comparison. Super quick.

If you do need .Contains, you may want to think about restructuring the app. What do these strings in the list really represent? Separate sub-groups of users? Can I test each string, at the time it's added, for whether it represents a user on the blacklist?

UPDATE: In response to @Rawling's comment below - If you know that there is a finite set of usernames which have, say, "hacker" as a substring, that set would have to be pretty large before running a .Equals test of each username against a candidate would be slower than running .Contains on the candidate. This is because HashCode is really quick.

If you are using entity framework or linq to sql then using linq and sending the query to a server can improve the performance. Then instead of removing the items you are actually querying for the items that fulfil the requirements, ie user where the name doesn't contain the banned expression:

listUser.Where(u => !key_blacklist.Any(u.Contains)).Select(u => u).ToList();

A possible solution is to use a tree-like data structure.

The basic idea is to have the blacklisted words organised like this:

+ h
| + ha
|   + hac
|     - hacker
|     - [other words beginning with hac]
| 
+ f
| + fu
|   + fuk
|     - fukoff
|     - [other words beginning with fuk]

Then, when you check for blacklisted words, you avoid searching the whole list of words beginning with "hac" if you find out that your user string does not even contain "h".

In the example I provided, with your sample data, this does not of course make any difference, but with the real data sets this should reduce significantly the number of Contains , since you don't check against the full list of blacklisted words every time.

Here is a code example (please note that the code is pretty bad, this is just to illustrate my idea)

using System;
using System.Collections.Generic;
using System.Linq;

class Program {

    class Blacklist {
        public string Start;
        public int Level;
        const int MaxLevel = 3;
        public Dictionary<string, Blacklist> SubBlacklists = new Dictionary<string, Blacklist>();
        public List<string> BlacklistedWords = new List<string>();

        public Blacklist() {
            Start = string.Empty;
            Level = 0;
        }

        Blacklist(string start, int level) {
            Start = start;
            Level = level;
        }

        public void AddBlacklistedWord(string word) {
            if (word.Length > Level && Level < MaxLevel) {
                string index = word.Substring(0, Level + 1);
                Blacklist sublist = null;
                if (!SubBlacklists.TryGetValue(index, out sublist)) {
                    sublist = new Blacklist(index, Level + 1);
                    SubBlacklists[index] = sublist;
                }
                sublist.AddBlacklistedWord(word);
            } else {
                BlacklistedWords.Add(word);
            }
        }

        public bool ContainsBlacklistedWord(string wordToCheck) {
            if (wordToCheck.Length > Level && Level < MaxLevel) {
                foreach (var sublist in SubBlacklists.Values) {
                    if (wordToCheck.Contains(sublist.Start)) {
                        return sublist.ContainsBlacklistedWord(wordToCheck);
                    }
                }
            }
            return BlacklistedWords.Any(x => wordToCheck.Contains(x));
        }

    }

    static void Main(string[] args) {

        List<string> listUser = new List<string>();
        listUser.Add("user1");
        listUser.Add("user2");
        listUser.Add("userhacker");
        listUser.Add("userfukoff1");

        Blacklist blacklist = new Blacklist();
        blacklist.AddBlacklistedWord("hacker");
        blacklist.AddBlacklistedWord("fukoff");

        foreach (string user in listUser) {
            if (blacklist.ContainsBlacklistedWord(user)) {
                Console.WriteLine("Contains blacklisted word: {0}", user);
            }
        }
    }
}

You are using the wrong thing. If you have a lot of data, you should be using either HashSet<T> or SortedSet<T> . If you don't need the data sorted, go with HashSet<T> . Here is a program I wrote to demonstrate the time differences:

class Program
{
    private static readonly Random random = new Random((int)DateTime.Now.Ticks);

    static void Main(string[] args)
    {
        Console.WriteLine("Creating Lists...");
        var stringList = new List<string>();
        var hashList = new HashSet<string>();
        var sortedList = new SortedSet<string>();

        var searchWords1 = new string[3];
        int ndx = 0;

        for (int x = 0; x < 1000000; x++)
        {
            string str = RandomString(10);

            if (x == 5 || x == 500000 || x == 999999)
            {
                str = "Z" + str;
                searchWords1[ndx] = str;
                ndx++;
            }
            stringList.Add(str);
            hashList.Add(str);
            sortedList.Add(str);
        }

        Console.WriteLine("Lists created!");
        var sw = new Stopwatch();
        sw.Start();
        bool search1 = stringList.Contains(searchWords1[2]);
        sw.Stop();
        Console.WriteLine("List<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
        sw.Reset();
        sw.Start();
        search1 = hashList.Contains(searchWords1[2]);
        sw.Stop();
        Console.WriteLine("HashSet<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
        sw.Reset();
        sw.Start();
        search1 = sortedList.Contains(searchWords1[2]);
        sw.Stop();
        Console.WriteLine("SortedSet<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
    }

    private static string RandomString(int size)
    {
        var builder = new StringBuilder();
        char ch;
        for (int i = 0; i < size; i++)
        {
            ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * random.NextDouble() + 65)));
            builder.Append(ch);
        }

        return builder.ToString();
    }
}

On my machine, I got the following results:

Creating Lists...
Lists created!
List<T> True ==> 15ms
HashSet<T> True ==> 0ms
SortedSet<T> True ==> 0ms

As you can see, List<T> was extremely slow comparted to HashSet<T> and SortedSet<T> . Those were almost instantaneous.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM