简体   繁体   English

用于评估两个列表之间的序列差异的算法

[英]Algorithm to assess differences in sequence between two lists

I am looking for an algorithm to compare two sequences. 我正在寻找一种比较两个序列的算法。

Sequence A - will be a list of integer Ids in optimal order 序列A - 将是最佳顺序的整数ID列表

Sequence B - will be a list of the same ids in an order that may differ. 序列B - 将按照可能不同的顺序列出相同的ID。

I want to be detect the differences in sequence between the two lists. 我想要检测两个列表之间的顺序差异。

and as such am looking for an algorithm to do this. 因此我正在寻找一种算法来做到这一点。 I am wondering if this is a common problem that has been solved before 我想知道这是否是以前解决过的常见问题

As Julián Urbano suggested, Kendall Tau Correlation is a good measure to use. 正如JuliánUrbano所说,Kendall Tau Correlation是一个很好的衡量标准。 I decided to implement it in .Net using Linq. 我决定使用Linq在.Net中实现它。 Here is my code, which implements both Tau-A (for data without ties) and Tau-B (which permits ties). 这是我的代码,它实现了Tau-A(对于没有关系的数据)和Tau-B(允许关系)。 The code assumes that your data has not yet been sorted, so it sorts it once according to Measure1 to get the first set of rank values, then it sorts it by Measure2 to get a second set of ranks. 代码假定您的数据尚未排序,因此根据Measure1对其进行一次排序以获得第一组排名值,然后通过Measure2对其进行排序以获得第二组排名。 It is the ranks that are correlated, not the original data. 它是相关的排名,而不是原始数据。 (If the measure lambda function returns the original object unchanged, then you can apply it to ranks that you already have.) (如果度量lambda函数返回原始对象不变,则可以将其应用于已有的行。)

using System;
using System.Collections.Generic;
using System.Linq;
using static System.Math;

namespace Statistics
{
    /// <summary>
    /// Compute the Kendall Tau Correlation of two orderings of values, a non-parametric correlation that compares the ranking
    /// of value, not the values themselves.
    /// 
    /// The correlation of the measures is one if all values are in the same order when sorted by two different measures.
    /// The correlation is minus one if the second ordering is the reverse of the first.
    /// The correlation is zero if the values are completely uncorrelated.
    /// 
    /// Two algorithms are provided: TauA and TauB. TauB accounts properly for duplicate values (ties), unlike TauA.
    /// </summary>
    public class KendallTauCorrelation<T, C> where C : IComparable<C>
    {
        private Func<T, C> Measure1 { get; }
        private Func<T, C> Measure2 { get; }

        public KendallTauCorrelation(Func<T, C> measure1, Func<T, C> measure2)
        {
            Measure1 = measure1;
            Measure2 = measure2;
        }

        /// <summary>
        /// Compute the Tau-a rank correlation, which is suitable if there are no ties in rank.
        /// </summary>
        /// <returns>A value between -1 and 1. 
        /// If the measures are ranked the same by both measures, returns 1.
        /// If the measures are ranked in exactly opposite order, return -1.
        /// The more items that are out of sequence, the lower the score.
        /// If the measures are completely uncorrelated, returns zero.
        /// </returns>
        /// <param name="data">Data to be ranked according to two measures and then correlated.</param>
        public double TauA(IList<T> data)
        {
            var ranked = data
                     .OrderBy(Measure1)
                     .Select((item, index) => new { Data = item, Rank1 = index + 1})
                     .OrderBy(pair => Measure2(pair.Data))
                     .Select((pair, index) => new { pair.Rank1, Rank2 = index + 1 })
                     .ToList();
            var numerator = 0;

            var n = ranked.Count;
            var denominator = n * (n - 1) / 2.0;
            for (var i = 1; i < n; i++)
                for (var j = 0; j < i; j++)
                {
                    numerator += Sign(ranked[i].Rank1 - ranked[j].Rank1) 
                               * Sign(ranked[i].Rank2 - ranked[j].Rank2);
                }
            return numerator / denominator;
        }

        /// <summary>
        /// Compute the Tau-b correlation, which accounts for ties.
        /// 
        ///             n  - n
        ///              c    d
        ///  τ  = -----------------------
        ///   b    _____________________
        ///       / (n  -  n )(n  -  n )
        ///      √    0     1   0     2
        /// 
        /// where:
        ///        n0 = n(n-1)/2
        ///               
        ///        n1 =  Σ  t (t - 1)/2
        ///              i   i  i
        /// 
        ///        n2 =  Σ  t (t - 1)/2
        ///              j   j  j
        /// 
        ///      t[i] = # of ties for the ith group according to measure 1.
        ///      t[j] = # of ties for the jth group according to measure 2.
        ///        nc = # of concordant pairs
        ///        nd = # of discordant pairs
        /// </summary>
        /// <returns>A correlation value between -1 (perfect reverse correlation)
        ///  and +1 (perfect correlation). 
        /// Zero means uncorrelated. </returns>
        /// <param name="data">Data.</param>
        public double TauB(IEnumerable<T> data)
        {
            // Compute two Ranks by sorting first by Measure1 and then by Measure2.
            // Group by like values of each in order to handle ties.
            var ranked = data.Select(item => new { M1 = Measure1(item), M2 = Measure2(item) })
                .GroupBy(measures => new { measures.M1 })
                .OrderBy(@group => @group.First().M1)
                .ThenBy(@group => @group.First().M2)
                .AsEnumerable()
                .Select((@group, groupIndex) => new
                {
                    Measure1Ranked = @group.Select((measure, index) => new { measure.M1, measure.M2 }),
                    Rank = ++groupIndex
                })
                .SelectMany(v => v.Measure1Ranked, (s, i) => new
                {
                    i.M1,
                    i.M2,
                    DenseRank1 = s.Rank
                })
                .GroupBy(measures => new { measures.M2 })
                .OrderBy(@group => @group.First().M2)
                .ThenBy(@group => @group.First().M1)
                .AsEnumerable()
                .Select((@group, groupIndex) => new
                 {
                     Measure2Ranked = @group.Select((measure, index) => new { measure.M1, measure.M2, measure.DenseRank1 }),
                     Rank = ++groupIndex
                 })
                .SelectMany(v => v.Measure2Ranked, (s, i) => new { i.M1, i.M2, i.DenseRank1, DenseRank2 = s.Rank })                          
                .ToArray();
            if (ranked.Length <= 1)
                return 0; // No data or one data point. Impossible to establish correlation.

            // Now that we have ranked the data, compute the correlation.
            var n = ranked.Count();
            var n0 = n * (n - 1) / 2;
            var n1 = 0;
            var n2 = 0;
            var numerator = 0; // Stores nc - nd as a single value, rather than computing them separately.
            for (var i = 1; i < n; i++)
                for (var j = 0; j < i; j++)
                {
                    var iRanked = ranked[i];
                    var jRanked = ranked[j];
                    numerator += Sign(iRanked.DenseRank1 - jRanked.DenseRank1)
                               * Sign(iRanked.DenseRank2 - jRanked.DenseRank2);
                    // Keep track of ties. Because we are running the indices in a triangle,
                    // we automatically get this for n1 and n2: ties * (ties - 1) / 2
                    if (iRanked.M1.CompareTo(jRanked.M1) == 0)
                        n1++;
                    if (iRanked.M2.CompareTo(jRanked.M2) == 0)
                        n2++;
                }
            if (n0 == n1 || n0 == n2)
                return 0; // All ties, so everything as the same rank.
            // Observe that if n1 = n2 = 0, that this formula is identical to Tau-a.
            return numerator / Sqrt((double)(n0 - n1)*(n0 - n2));
        }
    }
}

Here are the unit tests in NUnit: 以下是NUnit中的单元测试:

using System;
using NUnit.Framework;
using static System.Math; // New C# 6.0 feature that allows one to import static methods and call them without their class name.

namespace Statistics
{

    [TestFixture]
    public class KendallTauCorrelationTests
    {
        public static int[] OneToTen = new[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };

        #region Tau-a

        [Test]
        public void TauA_SameOrder()
        {
            var kendall = new KendallTauCorrelation<int, int>(
                (int value) => value,
                (int value) => value * 10
            );
            Assert.AreEqual(
                1.0,
                kendall.TauA(OneToTen),
                "Numbers that sort in the same order should be perfectly correlated."
            );
        }

        [Test]
        public void TauA_ReverseOrder()
        {
            var kendall = new KendallTauCorrelation<int, int>(
                (int value) => value,
                (int value) => value * -10
            );
            Assert.AreEqual(
                -1.0,
                kendall.TauA(OneToTen),
                "Numbers that sort in reverse order should be perfectly anti-correlated."
            );
        }

        [Test]
        public void TauA_OneSwap()
        {
            var reordered = new[] { 1, 2, 3, 5, 4, 6, 7, 8, 9, 10 };
            var kendall = new KendallTauCorrelation<int, int>(
                (int value) => value,
                (int value) => reordered[value - 1]
            );
            Assert.AreEqual(
                43.0 / 45.0,
                kendall.TauA(OneToTen),
                0.00001,
                "If a single number is out of place the sequences should be almost perfectly correlated."
            );
        }

        #endregion

        #region Tau-b

        [Test]
        public void TauB_SameOrder()
        {
            var kendall = new KendallTauCorrelation<int,int>(
                (int value) => value, 
                (int value) => value * 10
            );
            Assert.AreEqual(
                1.0, 
                kendall.TauB(OneToTen), 
                "Numbers that sort in the same order should be perfectly correlated."
            );
        }

        [Test]
        public void TauB_ReverseOrder()
        {
            var kendall = new KendallTauCorrelation<int, int>(
                (int value) => value,
                (int value) => value * -10
            );
            Assert.AreEqual(
                -1.0,
                kendall.TauB(OneToTen),
                "Numbers that sort in reverse order should be perfectly anti-correlated."
            );
        }

        [Test]
        public void TauB_OneSwap_NoTies()
        {
            var reordered = new[] { 1,2,3,5,4,6,7,8,9,10 };
            var kendall = new KendallTauCorrelation<int, int>(
                (int value) => value,
                (int value) => reordered[value-1]
            );
            Assert.AreEqual(
                43.0/45.0,
                kendall.TauB(OneToTen),
                0.00001,
                "If a single number is out of place the sequences should be almost perfectly correlated."
            );
        }

        [Test]
        public void TauB_Ties()
        {
            var reordered = new[] { 1, 1, 1, 4, 5, 6, 7, 8, 9, 10 };
            var kendall = new KendallTauCorrelation<int, int>(
                (int value) => value,
                (int value) => reordered[value - 1]
            );
            Assert.AreEqual(
                42.0 / Sqrt(42.0*45.0),
                kendall.TauB(OneToTen),
                0.00001,
                "Adding a few ties should be almost perfectly correlated."
            );
        }

        #endregion
    }
}

NOTE: This uses the exhaustive, O(N^2) algorithm. 注意:这使用详尽的O(N ^ 2)算法。 There is a more efficient way using a modified mergesort that is N Log N that I have heard mentioned, but I have not seen how it is done. 有一种更有效的方法,使用我听过提到的N Log N的修改后的mergesort,但我还没有看到它是如何完成的。

NOTE: This generic class assumes that both measures return the same data type. 注意:此泛型类假定两个度量都返回相同的数据类型。 It is an easy change to make the class have two generic measure types. 使类具有两种通用度量类型是一项简单的更改。 They only need to be IComparables. 他们只需要IComparables。 They do not need to be compared to each other. 它们不需要相互比较。

If you just want to measure how different they are, but you don't care of where the differences occur, you may use Kendall's correlation coefficient . 如果您只想测量它们之间的差异 ,但不关心差异发生的位置,则可以使用Kendall的相关系数 It gives you a score from -1 (the lists are in reverse order) to +1 (the lists are in the same order). 它为您提供从-1开始的分数(列表以相反的顺序)到+1(列表的顺序相同)。

It basically counts the number of pairs of elements that are in the same order in both lists, and divide by the total number of pairs: 它基本上计算两个列表中相同顺序的元素对的数量,并除以对的总数:

int[] a = { 1, 2, 3, 4, 5, 6, 7, 8 };
int[] b = { 3, 4, 1, 8, 6, 7, 2, 5 };

double numer = 0;
for (int i = 0; i < (a.Length - 1); i++)
  for (int j = i + 1; j < a.Length; j++)
    numer += Math.Sign(a[i] - a[j]) * Math.Sign(b[i] - b[j]);

double tau = numer / (a.Length * (a.Length - 1) / 2);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM