简体   繁体   English

什么NLP工具用于匹配具有相似含义或语义的短语

[英]What NLP tools to use to match phrases having similar meaning or semantics

I am working on a project which requires me to match a phrase or keyword with a set of similar keywords. 我正在开展一个项目,要求我将短语或关键字与一组相似的关键字相匹配。 I need to perform semantic analysis for the same. 我需要对它进行语义分析。

an example: 一个例子:

Relevant QT 相关QT
cheap health insurance 廉价的健康保险
affordable health insurance 实惠的健康保险
low cost medical insurance 低成本医疗保险
health plan for less 健康计划少
inexpensive health coverage 廉价的健康保险

Common Meaning 常识

low cost health insurance 低成本的健康保险

Here the the word under Common Meaning column should match the under Relevant QT column. Common Common列下的单词应与Under Related QT列相匹配。 I looked at a bunch of tools and techniques to do the same. 我看了一堆工具和技术来做同样的事情。 S-Match seemed very promising, but I have to work in Python, not in Java. S-Match似乎非常有前途,但我必须使用Python而不是Java。 Also Latent Semantic Analysis looks good but I think its more for document classification based upon a Keyword rather than keyword matching. 潜在语义分析看起来也不错,但我认为更多的是基于关键字而不是关键字匹配的文档分类。 I am somewhat familiar with NLTK. 我对NLTK有点熟悉。 Could someone provide some insight on what direction I should proceed and what tools I should use for the same? 有人可以提供一些有关我应该采取的方向以及我应该使用哪些工具的见解?

If you have a big corpus, where these words occur, available, you can train a model to represent each word as vector. 如果您有一个大的语料库,这些单词出现,可用,您可以训练模型将每个单词表示为向量。 For instance, you can use deep learning via word2vec's "skip-gram and CBOW models", they are implemented in the gensim software package 例如,您可以通过word2vec的“skip-gram和CBOW模型”使用深度学习,它们在gensim软件包中实现

In the word2vec model, each word is represented by a vector, you can then measure the semantic similarity between two words by measuring the cosine of the vectors representing th words. 在word2vec模型中,每个单词由向量表示,然后您可以通过测量代表单词的向量的余弦来测量两个单词之间的语义相似性。 Semantic similar words should have a high cosine similarity, for instance: 语义相似的单词应具有高余弦相似度,例如:

model.similarity('cheap','inexpensive') = 0.8

(The value is made up, just for illustration.) (该值仅用于说明。)

Also, from my experiments, summing a relatively small number of words (ie, up to 3 or 4 words) preserves the semantics, for instance: 此外,根据我的实验,总结相对较少数量的单词(即最多3或4个单词)可以保留语义,例如:

vector1 = model['cheap']+model['health']+model['insurance']
vector2 = model['low']+model['cost']+model['medical']+model['insurance']

similarity(vector1,vector2) = 0.7

(Again, just for illustration.) (再次,仅用于说明。)

You can use this semantic similarity measure between words as a measure to generate your clusters. 您可以在单词之间使用此语义相似性度量作为生成群集的度量。

When Latent Semantic Analysis refers to a "document", it basically means any set of words that is longer than 1. You can use it to compute the similarity between a document and another document, between a word and another word, or between a word and a document. 当潜在语义分析引用“文档”时,它基本上意味着任何长于1的单词集。您可以使用它来计算文档与另一个文档之间,单词与另一个单词之间或单词之间的相似性和一份文件。 So you could certainly use it for your chosen application. 所以你当然可以将它用于你选择的应用程序。

Other algorithms that may be useful include: 其他可能有用的算法包括:

I'd start by taking a look at Wordnet. 我首先来看看Wordnet。 It will give you real synonyms and other word relations for hundreds of thousands of terms. 它将为您提供数十万个术语的真实同义词和其他单词关系。 Since you tagged the nltk : It provides bindings for Wordnet, and you can use it as the basis for domain-specific solutions. 由于您标记了nltk :它为Wordnet提供了绑定,您可以将其用作特定于域的解决方案的基础。

Still in the NLTK, check out the discussion of the method similar() in the introduction to the NLTK book, and the class nltk.text.ContextIndex that it's based on. 仍然在NLTK中,查看NLTK书籍简介中 similar()方法的讨论,以及它所基于的类nltk.text.ContextIndex (All pretty simple still, but it might be all you really need). (一切都很简单,但它可能是你真正需要的)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM