简体   繁体   English

比较C#中的两个有序列表

[英]Compare Two Ordered Lists in C#

The problem is that I have two lists of strings. 问题是我有两个字符串列表。 One list is an approximation of the other list, and I need some way of measuring the accuracy of the approximation. 一个列表是另一个列表的近似值,我需要一些方法来测量近似值的准确性。

As a makeshift way of scoring the approximation, I have bucketed each list(the approximation and the answer) into 3 partitions (high, medium low) after sorting based on a numeric value that corresponds to the string. 作为对近似值进行评分的临时方法,我根据与字符串对应的数值进行排序后,将每个列表(近似值和答案)分成3个分区(高,中低)。 I then compare all of the elements in the approximation to see if the string exists in the same partition of the correct list. 然后我比较近似中的所有元素,看看字符串是否存在于正确列表的同一分区中。

I sum the number of correctly classified strings and divide it by the total number of strings. 我总结了正确分​​类的字符串的数量,并将其除以字符串的总数。 I understand that this is a very crude way of measuring the accuracy of the estimate, and was hoping that better alternatives were available. 我知道这是衡量估算准确性的一种非常粗略的方法,并希望有更好的替代方案。 This is a very small component of a larger piece of work, and was hoping to not have to reinvent the wheel. 这是较大工作的一个非常小的组成部分,并希望不必重新发明轮子。

EDIT: I think I wasn't clear enough. 编辑:我想我不够清楚。 I don't need the two lists to be exactly equal, I need some sort of measure that shows the lists are similar. 我不需要两个列表完全相同,我需要某种措施来显示列表是相似的。 For example, The High-Medium-Low (HML) approach we have taken shows that the estimated list is sufficiently similar. 例如,我们采用的高中低(HML)方法表明估计列表足够相似。 The downside of this approach is that if the estimated list has an item at the bottom of the "High" bracket, and in the actual list, the item is at the top of the medium set, the score algorithm fails to deliver. 这种方法的缺点是,如果估计列表在“高”括号的底部有一个项目,并且在实际列表中,该项目位于介质集合的顶部,则分数算法无法提供。

It could potentially be that in addition to the HML approach, the bottom 20% of each partition is compared to the top 20% of the next partition or something along those lines. 它可能是除了HML方法之外,每个分区的底部20%与下一个分区的前20%或沿着这些线的东西进行比较。

Thanks all for your help!! 感谢你的帮助!!

Nice question. 好问题。 Well, I think you could use the following method to compare your lists: 好吧,我认为您可以使用以下方法来比较您的列表:

 public double DetermineAccuracyPercentage(int numberOfEqualElements, int yourListsLength)
    {
        return ((double)numberOfEqualElements / (double)yourListsLength) * 100.0; 
    }

The number returned should determine how much equality exists between your two lists. 返回的数字应该确定两个列表之间存在多少相等性。 If numberOfEqualElements = yourLists.Length (Count) so they are absolutely equal. 如果numberOfEqualElements = yourLists.Length(Count),那么它们绝对相等。 The accuracy of the approximation = (numberOfEqualElements / yourLists.Length) 1 = completely equal , 0 = absolutely different, and the values between 0 and 1 determine the level of equality. 近似的准确性=(numberOfEqualElements / yourLists.Length)1 =完全相等,0 =绝对不同,0和1之间的值决定了相等程度。 In my sample a percentage. 在我的样本中百分比。

If you compare these 2 lists, you will retrieve a 75% of equality, the same that 3 of 4 equal elements (3/4). 如果比较这两个列表,您将检索75%的相等,与4个相等元素中的3个相同(3/4)。

        IList<string> list1 = new List<string>();
        IList<string> list2 = new List<string>();

        list1.Add("Dog");
        list1.Add("Cat");
        list1.Add("Fish");
        list1.Add("Bird");

        list2.Add("Dog");
        list2.Add("Cat");
        list2.Add("Fish");
        list2.Add("Frog");


          int resultOfComparing = list1.Intersect(list2).Count();
        double accuracyPercentage = DetermineAccuracyPercentage(resultOfComparing,   list1.Count); 

I hope it helps. 我希望它有所帮助。

So, we're taking a sequence of items and grouping it into partitions with three categories, high, medium, and low. 因此,我们采用一系列项目并将其分组为三个类别,分别为高,中,低三个类别。 Let's first make an object to represent those three partitions: 让我们首先创建一个对象来表示这三个分区:

public class Partitions<T>
{
    public IEnumerable<T> High { get; set; }
    public IEnumerable<T> Medium { get; set; }
    public IEnumerable<T> Low { get; set; }
}

Next to make an estimate we want to take two of these objects, one for the actual and one for the estimate. 接下来做一个估计我们想要取两个这样的对象,一个用于实际,一个用于估计。 For each priority level we want to see how many of the items are in both collections; 对于每个优先级,我们希望看到两个集合中有多少项; this is an "Intersection"; 这是一个“交叉点”; we want to sum up the counts of the intersection of each set. 我们想要总结每组的交集计数。

Then just divide that count by the total: 然后将该数除以总数:

public static double EstimateAccuracy<T>(Partitions<T> actual
    , Partitions<T> estimate)
{
    int correctlyCategorized = 
        actual.High.Intersect(estimate.High).Count() +
        actual.Medium.Intersect(estimate.Medium).Count() +
        actual.Low.Intersect(estimate.Low).Count();

    double total = actual.High.Count()+
        actual.Medium.Count()+
        actual.Low.Count();

    return correctlyCategorized / total;
}

Of course, if we generalize this to not 3 priorities, but rather a sequence of sequences, in which each sequence corresponds to some bucket (ie there are N buckets, not just 3) the code actually gets easier: 当然,如果我们将其概括为3个优先级,而不是一系列序列,其中每个序列对应一些桶(即有N个桶,而不仅仅是3个),代码实际上变得更容易:

public static double EstimateAccuracy<T>(
    IEnumerable<IEnumerable<T>> actual
    , IEnumerable<IEnumerable<T>> estimate)
{
    var query = actual.Zip(estimate, (a, b) => new
    {
        valid = a.Intersect(b).Count(),
        total = a.Count()
    }).ToList();
    return query.Sum(pair => pair.valid) /
        (double)query.Sum(pair => pair.total);
}

I would take both List<String> s and combine each element into a IEnumerable<Boolean> : 我将两个List<String>并将每个元素组合成一个IEnumerable<Boolean>

public IEnumerable<Boolean> Combine<Ta, Tb>(List<Ta> seqA, List<Tb> seqB)
{
  if (seqA.Count != seqB.Count)
    throw new ArgumentException("Lists must be the same size...");

  for (int i = 0; i < seqA.Count; i++)
    yield return seqA[i].Equals(seqB[i]));
}

And then use Aggregate() to verify which strings match and keep a running total: 然后使用Aggregate()来验证哪些字符串匹配并保持运行总计:

var result = Combine(a, b).Aggregate(0, (acc, t)=> t ? acc + 1 : acc) / a.Count; 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM