简体   繁体   English

如何在C#中计算两个字符串之间的相似度?

[英]How can I calculate similarity between two strings in C#?

I'm looking to assess similarity (including case) between two strings and give a value between 0 and 1. 我希望评估两个字符串之间的相似性(包括大小写),并提供介于0和1之间的值。

I tried the Levenshtein distance implementation but it only gives integers and does not compare inner alphabets. 我尝试了Levenshtein距离实现,但它仅给出整数,并且不比较内部字母。

For eg comparing "ABCD" and "Abcd" gives distance of 3 and "AOOO" also gives a distance of 3 but clearly "Abcd" is better match than "AOOO". 例如,比较“ ABCD”和“ Abcd”的距离为3,而“ AOOO”也给出的距离为3,但显然“ Abcd”比“ AOOO”更好地匹配。

So compared to "ABCD" I want "ABcd" to be most similar then "Abcd" then "AOOO" then "AOOOO" 因此,与“ ABCD”相比,我希望“ ABcd”最相似,然后是“ Abcd”,然后是“ AOOO”,然后是“ AOOOO”

I've also looked here but I am not looking for a variable length algorithm. 我也在这里看过但我没有在寻找可变长度算法。

Thanks 谢谢

Try something like this 试试这个

double d = (LevenshteinDist(s, t) + LevenshteinDist(s.ToLower(), t.ToLower())) /
           2.0 * Math.Max(s.Length, t.Length);

If you want to give less importance to case differences than letter differences, you can give different weights to the terms 如果您不希望区分大小写而不是字母差异,则可以对术语赋予不同的权重

double d = (0.15*LevenshteinDist(s, t) + 
            0.35*LevenshteinDist(s.ToLower(), t.ToLower())) /
           Math.Max(s.Length, t.Length);

Note that the weights sum up to 0.5, thus makting the division by 2.0 obsolete. 请注意,权重总和为0.5,因此使除以2.0已过时。

    bool check(string[] a, string s)
    {
        for (int i = 0; i < a.Length; i++)
            if (s == a[i])
                return true;
        return false;
    }

    public double simi(string string1, string string2)
    {
        int sub1 = 0;
        int sub2 = 0;
        string[] sp1 = new string[string1.Length - 1];
        string[] sp2 = new string[string2.Length - 1];
        string[] sp3 = new string[string1.Length - 1];
        string[] sp4 = new string[string2.Length - 1];
        for (int i = 0; i < string1.Length - 1; i++)
        {
            string x = "";
            x = string1.Substring(i, 2);

            sp1[sub1] = x;
            ++sub1;
        }
        for (int i = 0; i < string2.Length - 1; i++)
        {
            string x = "";
            x = string2.Substring(i, 2);
            sp2[sub2] = x;
            ++sub2;
        }


        int j = 0, k = 0;

        for (int i = 0; i < sp1.Length; i++)
            if (check(sp3, sp1[i]) == true)
            {

                continue;
            }
            else
            {
                sp3[j] = sp1[i];
                j++;

            }

        for (int i = 0; i < sp2.Length; i++)
            if (check(sp4, sp2[i]) == true)
            {

                continue;
            }
            else
            {
                sp4[k] = sp2[i];
                k++;


            }

        Array.Resize(ref sp3, j);
        Array.Resize(ref sp4, k);

        Array.Sort<string>(sp3);
        Array.Sort<string>(sp4);

        int n = 0;


        for (int i = 0; i < sp3.Length; i++)
        {

            if (check(sp4, sp3[i]))
            {

                n++;
            }


        }

        double resulte;

        int l1 = sp3.Length;
        int l2 = sp4.Length;

        resulte = ((2.0 * Convert.ToDouble(n)) / Convert.ToDouble(l1 + l2)) * 100;

        return resulte;
    }

Adapt Levenshtein Distance with a custom table T. Let the cost of insertion = 1. The cost of deletion also 1. Let T(c,d) denote the penalty of replacing c with d. 用自定义表T调整Levenshtein距离。设插入成本=1。删除成本也为1。设T(c,d)表示用d代替c的代价。 T(c,c) should be = 0. T(c,d) should be <= 2. T(c,c)应该= 0.T(c,d)应该<= 2。

Define Max(n,m) be the maximum theoretical distance of strings of length n and m. 将Max(n,m)定义为长度为n和m的字符串的最大理论距离。 Obviously, Max(n,m) = n+m. 显然,Max(n,m)= n + m。

Define Distance(s,t) be the cost of changing s to t divided by Max(s,t). 定义距离(s,t)是将s更改为t的成本除以Max(s,t)。 There you go. 妳去

Be careful in defining T so that the definition obeys distance axioms: 在定义T时要小心,以使定义遵循距离公理:

  • Distance(s,s) = 0 距离(s,s)= 0
  • Distance(s,t) = Distance(t,s) 距离(s,t)=距离(t,s)
  • Distance(s,t) <= Distance(s,u) + Distance(u,t) 距离(s,t)<=距离(s,u)+距离(u,t)

Then it will be more useful in more situations. 然后它将在更多情况下更有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM