简体   繁体   中英

fastest starts with search algorithm

I need to implement a search algorithm which only searches from the start of the string rather than anywhere within the string.

I am new to algorithms but from what I can see it seems as though they go through the string and find any occurrence.

I have a collection of strings (over 1 million) which need to be searched everytime the user types a keystroke.

EDIT:

This will be an incremental search. I currently have it implemented with the following code and my searches are coming back ranging between 300-700ms from over 1 million possible strings. The collection isnt ordered but there is no reason it couldnt be.

private ICollection<string> SearchCities(string searchString) {
        return _cityDataSource.AsParallel().Where(x => x.ToLower().StartsWith(searchString)).ToArray();
    }

I've adapted the code from this article from Visual Studio Magazine that implements a Trie .

The following program demonstrates how to use a Trie to do fast prefix searching.

In order to run this program, you will need a text file called "words.txt" with a large list of words. You can download one from Github here .

After you compile the program, copy the "words.txt" file into the same folder as the executable.

When you run the program, type a prefix (such as prefix ;)) and press return , and it will list all the words beginning with that prefix.

This should be a very fast lookup - see the Visual Studio Magazine article for more details!

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace ConsoleApp1
{
    class Program
    {
        static void Main()
        {
            var trie = new Trie();
            trie.InsertRange(File.ReadLines("words.txt"));

            Console.WriteLine("Type a prefix and press return.");

            while (true)
            {
                string prefix = Console.ReadLine();

                if (string.IsNullOrEmpty(prefix))
                    continue;

                var node = trie.Prefix(prefix);

                if (node.Depth == prefix.Length)
                {
                    foreach (var suffix in suffixes(node))
                        Console.WriteLine(prefix + suffix);
                }
                else
                {
                    Console.WriteLine("Prefix not found.");
                }

                Console.WriteLine();
            }
        }

        static IEnumerable<string> suffixes(Node parent)
        {
            var sb = new StringBuilder();
            return suffixes(parent, sb).Select(suffix => suffix.TrimEnd('$'));
        }

        static IEnumerable<string> suffixes(Node parent, StringBuilder current)
        {
            if (parent.IsLeaf())
            {
                yield return current.ToString();
            }
            else
            {
                foreach (var child in parent.Children)
                {
                    current.Append(child.Value);

                    foreach (var value in suffixes(child, current))
                        yield return value;

                    --current.Length;
                }
            }
        }
    }

    public class Node
    {
        public char Value { get; set; }
        public List<Node> Children { get; set; }
        public Node Parent { get; set; }
        public int Depth { get; set; }

        public Node(char value, int depth, Node parent)
        {
            Value = value;
            Children = new List<Node>();
            Depth = depth;
            Parent = parent;
        }

        public bool IsLeaf()
        {
            return Children.Count == 0;
        }

        public Node FindChildNode(char c)
        {
            return Children.FirstOrDefault(child => child.Value == c);
        }

        public void DeleteChildNode(char c)
        {
            for (var i = 0; i < Children.Count; i++)
                if (Children[i].Value == c)
                    Children.RemoveAt(i);
        }
    }

    public class Trie
    {
        readonly Node _root;

        public Trie()
        {
            _root = new Node('^', 0, null);
        }

        public Node Prefix(string s)
        {
            var currentNode = _root;
            var result = currentNode;

            foreach (var c in s)
            {
                currentNode = currentNode.FindChildNode(c);

                if (currentNode == null)
                    break;

                result = currentNode;
            }

            return result;
        }

        public bool Search(string s)
        {
            var prefix = Prefix(s);
            return prefix.Depth == s.Length && prefix.FindChildNode('$') != null;
        }

        public void InsertRange(IEnumerable<string> items)
        {
            foreach (string item in items)
                Insert(item);
        }

        public void Insert(string s)
        {
            var commonPrefix = Prefix(s);
            var current = commonPrefix;

            for (var i = current.Depth; i < s.Length; i++)
            {
                var newNode = new Node(s[i], current.Depth + 1, current);
                current.Children.Add(newNode);
                current = newNode;
            }

            current.Children.Add(new Node('$', current.Depth + 1, current));
        }

        public void Delete(string s)
        {
            if (!Search(s))
                return;

            var node = Prefix(s).FindChildNode('$');

            while (node.IsLeaf())
            {
                var parent = node.Parent;
                parent.DeleteChildNode(node.Value);
                node = parent;
            }
        }
    }
}

A couple of thoughts:

First, your million strings need to be ordered, so that you can "seek" to the first matching string and return strings until you no longer have a match...in order (seek via C# List<string>.BinarySearch , perhaps). That's how you touch the least number of strings possible.

Second, you should probably not try to hit the string list until there's a pause in input of at least 500 ms (give or take).

Third, your queries into the vastness should be async and cancelable, because it's certainly going to be the case that one effort will be superseded by the next keystroke.

Finally, any subsequent query should first check that the new search string is an append of the most recent search string...so that you can begin your subsequent seek from the last seek (saving lots of time).

I suggest using linq.

string x = "searchterm";
List<string> y = new List<string>();
List<string> Matches = y.Where(xo => xo.StartsWith(x)).ToList();

Where x is your keystroke search text term, y is your collection of strings to search, and Matches is the matches from your collection.

I tested this with the first 1 million prime numbers, here is the code adapted from above:

        Stopwatch SW = new Stopwatch();
        SW.Start();
        string x = "2";
        List<string> y = System.IO.File.ReadAllText("primes1.txt").Split(' ').ToList();
        y.RemoveAll(xo => xo == " " || xo == "" || xo == "\r\r\n");
        List <string> Matches = y.Where(xo => xo.StartsWith(x)).ToList();
        SW.Stop();
        Console.WriteLine("matches: " + Matches.Count);
        Console.WriteLine("time taken: " + SW.Elapsed.TotalSeconds);
        Console.Read();

Result is:

matches: 77025

time taken: 0.4240604

Of course this is testing against numbers and I don't know whether linq converts the values before, or if numbers make any difference.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM