简体   繁体   中英

Best way of sorting a list of strings based on difference from a target string?

I need to sort a List based on the difference between the strings in the list and a target string.

What's the best way of implementing this kind of sorting algorithm?

I don't care too much about performance but the collection could potentially become big (let's say half a million tops).

Any Help Appreciated!

I would recommend calculating the Levenshtein distance and then simply ordering by the integer result. ( Magic code )

public void Example()
{
    string target = "target";

    List<string> myStings = new List<string>();

    myStings.Add("this");
    myStings.Add("that");

    myStrings = myStrings.OrderBy(each => Levenshtein(each, target)).ToList();
}

public int Levenshtein(string stringA, string stringB)
{
    // Magic goes here
    return 0;
}

Without OrderBy for the old skool 2.0 guys?

List<string> myStrings;
myStrings.Sort(LevenshteinCompare);
...

public class LevenshteinCompare: IComparer<string>
{
    public int Compare(string x, string y)
    {
        // Magic goes here
    }
}

What's the best way of implementing this kind of sorting algorithm?

Being tongue-in-cheek, I'd suggest using the library implementation of quicksort, with the distance to the target string as the sorting key.

That's of course not a helpful answer. Why not? Because what you really want to know is "What's a good difference metric for strings?"

The answer to the real qusetion is, sadly, "it depends"; it depends on which properties of the distance you care about.

That being said, read up on the Levenstein Distance and what it really says about the strings.

You can modify the basic algorithm to skew the metric in favor of identical characters occurring in long runs by fiddling with the weighting of different steps in the dynamic programming matrix.

You can also use the Soundex algorithm, which says something about which strings sound similar (but that works best for short strings; I don't know what kind of input you use).

If the strings are of equal length, you can also use the hamming distance (count the number of indexes where the strings differ). That can probably be generalized to something by counting (unilaterally) non-existing indexes as always different, which gives you something Levenstein-like (kinda' sorta' maybe).

The short version: it depends. I've given some input, but I can't say which is going to be a good decision for you without some more information from you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM