简体   繁体   English

比较字符串数组的相似性

[英]Comparing arrays of strings for similarity

I have available to me hundreds of JSON strings. 我可以使用数百个JSON字符串。 Each of these contains an array of 15-20 words sorted by some predetermined weight. 这些中的每一个都包含一个15-20个单词的数组,按一定的预定权重排序。 This weight, if it's worth noting, is the amount of times these words are found in some chunk of text. 值得注意的是,此权重是在某些文本块中找到这些单词的次数。 What's the best way of finding similarity between arrays of words that are structured like this? 在如此结构的单词数组之间寻找相似性的最佳方法是什么?

First idea that came to my head was to create a numerical hash of all the words together and basically compare these values to determine similarity. 我想到的第一个想法是创建所有单词的数字哈希,然后基本上比较这些值以确定相似性。 I wasn't very successful with this, since the resulting hash values of very similar strings were not very close. 我这样做不是很成功,因为非常相似的字符串所产生的哈希值并不是很接近。 After some research regarding string comparison algorithms, I come to Stackoverflow in hopes of receiving more guidance. 在对字符串比较算法进行了一些研究之后,我来到Stackoverflow,希望获得更多指导。 Thanks in advance, and please let me know if you need more details of the problem. 在此先感谢您,如果您需要更多有关此问题的详细信息,请告诉我。

Edit 1: Clarifying what I'm trying to do: I want to determine how similar two arrays are according to the words each of these have. 编辑1:澄清我要做什么:我想根据每个数组的单词确定两个数组的相似度。 I would also like to take into consideration the weight each word carries in each array. 我还想考虑每个单词在每个数组中的权重。 For example: 例如:

var array1 = [{"word":"hill","count":5},{"word":"head","count":5}];
var array2 = [{"word":"valley","count":7},{"word":"head","count":5}];
var array3 = [{"word":"head", "count": 6}, {"word": "valley", "count": 5}];
var array4 = [{"word": "valley", "count": 7}, {"word":"head", "count": 5}];

In that example, array 4 and array 2 are more similar than array 2 and array 3 because, even though both have the same words, the weight is the same for both of them in array 4 and 2. I hope that makes it a little bit easier to understand. 在该示例中,数组4和数组2比数组2和数组3更相似,因为即使它们具有相同的单词,它们在数组4和2中的权重也相同。有点容易理解。 Thanks in advance. 提前致谢。

I think that what you want is " cosine similarity ", and you might also want to look at vector space models . 我认为您想要的是“ 余弦相似度 ”,并且您可能还想看看向量空间模型 If you are coding In Java, you can use the open source S-space package. 如果您使用Java进行编码,则可以使用开源S空间包。

(added on 31 Oct) Each element of the vector is the count of one particular string. (在10月31日添加)向量的每个元素都是一个特定字符串的计数。 You just need to transform your arrays of strings into such vectors. 您只需要将字符串数组转换为此类向量即可。 In your example, you have three words - "hill", "head", "valley". 在您的示例中,您有三个单词-“ hill”,“ head”,“ valley”。 If your vector is in that order, the vectors corresponding to the arrays would be 如果向量按该顺序排列,则对应于数组的向量将为

// array: #hill, #head, #valley
array1:  {5,     5,     0}
array2:  {0,     5,     7}
array3:  {0,     6,     5}
array4:  {0,     5,     7}

Given that each array has to be compared to every other array, you are looking at a serious amount of processing along the lines of ∑(n-1) times the average number of "words" in each array. 鉴于必须将每个数组与其他每个数组进行比较,因此您正在沿着∑(n-1)乘以每个数组中“单词”的平均数量的线进行大量处理。 You'll need to store the score for each comparison, then make some sense of it. 您需要为每个比较存储分数,然后对其进行一些理解。

eg 例如

var array1 = [{"word":"hill","count":5},{"word":"head","count":5}];
var array2 = [{"word":"valley","count":7},{"word":"head","count":5}];
var array3 = [{"word":"head", "count": 6}, {"word": "valley", "count": 5}];
var array4 = [{"word": "valley", "count": 7}, {"word":"head", "count": 5}];

// Comparison score is summed product of matching word counts
function compareThings() {

  var a, b, i = arguments.length,
      j, m, mLen, n, nLen;
  var word, score, result = [];

  if (i < 2) return;

  // For each array
  while (i--) {
    a = arguments[i];
    j = i;

    // Compare with every other array
    while (j--) {
      b = arguments[j];
      score = 0;

      // For each word in array
      for (m=0, mLen = b.length; m<mLen; m++) {
        word = b[m].word

        // Compare with each word in other array
        for (n=0, nLen=a.length; n<nLen; n++) {

          // Add to score
          if (a[n].word == word) {
            score += a[n].count * b[m].count;
          }
        }
      }

      // Put score in result
      result.push(i + '-' + j + ':' + score);
    }
  }
  return result;
}

var results = compareThings(array1, array2, array3, array4);

alert('Raw results:\n' + results.join('\n'));
/*
Raw results:
3-2:65
3-1:74
3-0:25
2-1:65
2-0:30
1-0:25
*/

results.sort(function(a, b) {
  a = a.split(':')[1];
  b = b.split(':')[1];
  return b - a;
});

alert('Sorted results:\n' + results.join('\n'));
/*
Sorted results:
3-1:74
3-2:65
2-1:65
2-0:30
3-0:25
1-0:25
*/

So 3-1 (array4 and array2) have the highest score. 因此3-1(array4和array2)的得分最高。 Fortunately the comparison need only be one way, you don't have to compare a to b and b to a. 幸运的是,比较仅是一种方法,您不必将a与b进行比较,而将b与a进行比较。

Here is an attempt. 这是一个尝试。 The algorithm is not very smart (a difference > 20 is the same as not having the same words), but could be a useful start: 该算法不是很聪明(相差> 20等于没有相同的单词),但是可能是一个有用的开始:

var wordArrays = [
    [{"word":"hill","count":5},{"word":"head","count":5}]
  , [{"word":"valley","count":7},{"word":"head","count":5}]
  , [{"word":"head", "count": 6}, {"word": "valley", "count": 5}]
  , [{"word": "valley", "count": 7}, {"word":"head", "count": 5}]
]

function getSimilarTo(index){
    var src = wordArrays[index]
      , values

    if (!src) return null;

    // compare with other arrays
    weighted = wordArrays.map(function(arr, i){
        var diff = 0
        src.forEach(function(item){
            arr.forEach(function(other){
                if (other.word === item.word){
                    // add the absolute distance in count
                    diff += Math.abs(item.count - other.count)
                } else {
                    // mismatches
                    diff += 20
                }
            })
        })
        return {
            arr   : JSON.stringify(arr)
          , index : i
          , diff  : diff
        }
    })

    return weighted.sort(function(a,b){
        if (a.diff > b.diff) return 1
        if (a.diff < b.diff) return -1
        return 0
    })
}

/*
getSimilarTo(3)
[ { arr: '[{"word":"valley","count":7},{"word":"head","count":5}]',
    index: 1,
    diff: 100 },
  { arr: '[{"word":"valley","count":7},{"word":"head","count":5}]',
    index: 3,
    diff: 100 },
  { arr: '[{"word":"head","count":6},{"word":"valley","count":5}]',
    index: 2,
    diff: 103 },
  { arr: '[{"word":"hill","count":5},{"word":"head","count":5}]',
    index: 0,
    diff: 150 } ]
*/

Sort the arrays by word before attempting comparison. 尝试比较之前,请按单词对数组进行排序。 Once this is complete, comparing two arrays will require exactly 1 pass through each array. 完成此操作后,比较两个阵列将需要对每个阵列进行精确的1次传递。

After sorting the arrays, here is a compare algorithm (psuedo-java): 对数组进行排序后,下面是一个比较算法(psuedo-java):


int compare(array1, array2)
{
  returnValue = 0;
  array1Index = 0
  array2Index = 0;

  while (array1Index < array1.length)
  {
    if (array2Index < array2.length)
    {
      if (array1[array1Index].word == array2[array2Index].word) // words match.
      {
        returnValue += abs(array1[array1Index].count - array2[array2Index].count);
        ++array1Index;
        ++array2Index;
      }
      else // account for the unmatched array2 word.
      {
        // 100 is just a number to give xtra weight to unmatched numbers.
        returnValue += 100 + array2[array2Index].count;
        ++array2Index;
      }
    }
    else // array2 empty and array1 is not empty.
    {
      // 100 is just a number to give xtra weight to unmatched numbers.
      returnValue += 100 + array1[array1Index].count;
    }
  }

  // account for any extra unmatched array 2 values.
  while (array2Index < array2.length)
  {
      // 100 is just a number to give xtra weight to unmatched numbers.
      returnValue += 100 + array2[array2Index].count;
  }

  return returnValue;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM