简体   繁体   English

根据与目标字符串的差异对字符串列表进行排序的最佳方法?

[英]Best way of sorting a list of strings based on difference from a target string?

I need to sort a List based on the difference between the strings in the list and a target string. 我需要根据列表中的字符串与目标字符串之间的差异对列表进行排序。

What's the best way of implementing this kind of sorting algorithm? 实现这种排序算法的最佳方法是什么?

I don't care too much about performance but the collection could potentially become big (let's say half a million tops). 我不太在意性能,但是收藏可能会很大(比如说有100万顶)。

Any Help Appreciated! 任何帮助表示赞赏!

I would recommend calculating the Levenshtein distance and then simply ordering by the integer result. 我建议计算Levenshtein距离 ,然后简单地按整数结果排序。 ( Magic code ) 魔术码

public void Example()
{
    string target = "target";

    List<string> myStings = new List<string>();

    myStings.Add("this");
    myStings.Add("that");

    myStrings = myStrings.OrderBy(each => Levenshtein(each, target)).ToList();
}

public int Levenshtein(string stringA, string stringB)
{
    // Magic goes here
    return 0;
}

Without OrderBy for the old skool 2.0 guys? 如果没有OrderBy,那么旧的skool 2.0家伙呢?

List<string> myStrings;
myStrings.Sort(LevenshteinCompare);
...

public class LevenshteinCompare: IComparer<string>
{
    public int Compare(string x, string y)
    {
        // Magic goes here
    }
}

What's the best way of implementing this kind of sorting algorithm? 实现这种排序算法的最佳方法是什么?

Being tongue-in-cheek, I'd suggest using the library implementation of quicksort, with the distance to the target string as the sorting key. 作为开玩笑的人,我建议使用quicksort的库实现,并以到目标字符串的距离作为排序键。

That's of course not a helpful answer. 这当然不是一个有用的答案。 Why not? 为什么不? Because what you really want to know is "What's a good difference metric for strings?" 因为您真正想知道的是“什么是字符串的良好差异度量?”

The answer to the real qusetion is, sadly, "it depends"; 遗憾的是,对真正的疑问的答案是“取决于”。 it depends on which properties of the distance you care about. 这取决于您关心的距离的哪些属性。

That being said, read up on the Levenstein Distance and what it really says about the strings. 话虽如此,请阅读Levenstein距离及其对琴弦的真实描述。

You can modify the basic algorithm to skew the metric in favor of identical characters occurring in long runs by fiddling with the weighting of different steps in the dynamic programming matrix. 您可以修改基本算法,以使度量标准偏向于长期运行中出现的相同字符,方法是摆弄动态编程矩阵中不同步骤的权重。

You can also use the Soundex algorithm, which says something about which strings sound similar (but that works best for short strings; I don't know what kind of input you use). 您还可以使用Soundex算法,该算法说明哪些字符串听起来相似(但最适合短字符串;我不知道您使用哪种输入)。

If the strings are of equal length, you can also use the hamming distance (count the number of indexes where the strings differ). 如果字符串长度相等,则也可以使用汉明距离(计算字符串不同处的索引数)。 That can probably be generalized to something by counting (unilaterally) non-existing indexes as always different, which gives you something Levenstein-like (kinda' sorta' maybe). 通过将(单方面)不存在的索引计算为总是不同,可以将其概括为某种事物 ,这会给您一些类似于Levenstein的事物(也许有点“排序”)。

The short version: it depends. 简短的版本:取决于。 I've given some input, but I can't say which is going to be a good decision for you without some more information from you. 我已经提供了一些建议,但是如果不提供更多信息,那么我不能说这对您来说是个不错的决定。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM