简体   繁体   English

计算2个句子之间的相似度

[英]Calculating the similarity between 2 sentences

I would like to calculate the similarity between 2 sentences and I need the percentage value which says "how good" they match with each other. 我想计算2个句子之间的相似度,我需要一个百分比值来说明它们彼此匹配的程度。 Sentences like, 像这样的句子

1. The red fox is moving on the hill.
2. The black fox is moving in the bill.

I was considering about Levenshtein distance but I am not sure about this because it says it is for finding similarity between "2 words". 我当时正在考虑Levenshtein distance但是我不确定,因为它说这是为了寻找“ 2个字”之间的相似性。 So can this Levenshtein distance help me or what other method can help me? 那么这个Levenshtein distance可以帮助我吗?或者还有什么其他方法可以帮助我呢? I will be using JavaScript. 我将使用JavaScript。

尝试此解决方案JS string diff

Use Jaccard index . 使用Jaccard索引 You can find implementations in any language, including JavaScript ( here is one, didn't test it personally though). 您可以找到任何语言的实现,包括JavaScript( 是一种,虽然没有亲自测试过)。

this is what i would do depending on how important this is. 这是我会做的,具体取决于这有多重要。 if this is medium to low priority here is a simple algo. 如果是中到低优先级,这是一个简单的算法。

  1. scan all sentences and see how often a word occurs. 扫描所有句子,查看单词出现的频率。
  2. filter out the most common words like the ones in 30% of sentences , ie don't count these. 过滤掉最常见的单词,例如30%的句子中的单词,即不要计算这些单词。 so at the as would hopefully not be counted. 因此希望不会被计算在内。
  3. then do your bag of words comparison. 然后做你的单词比较。

But the context in why you want to do this is really important. 但是,为什么要执行此操作的上下文非常重要。 ie the example you gave us could be for students learning english etc. ie theres different algorithms i would use if i was trying to see if crowd sourced users are describing the same paragraph vs if article topics are similar enough for a suggested reading section. 也就是说,您提供给我们的示例可能是针对学习英语等的学生,也就是说,如果我尝试查看人群中的用户是否在描述同一段,而文章主题是否足够相似以建议阅读,那么我将使用不同的算法。

A common Method to compute the similarity of two sentences is to cosine similiarity. 计算两个句子相似度的常用方法是余弦相似度。 Don't know if there an implemenatation in JavaScript exists. 不知道JavaScript中是否存在实现。 The cosine similiarity looks on words and not of single letters. 余弦相似度仅针对单词而不是单个字母。 The web is full of explenations for example here . 该网站是完全explenations例如这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM