使用 word2vec 列出所有包含特定单词和相似单词的句子

Question

I have a table as following:我有一张如下表：

data = {'text':  ['The scent is nice','I like the smell', 'The smell is awesome', 'I find the scent amazing', 'I love the smell']}

df = pd.DataFrame (data, columns = ['text'])

I want to list all sentences that contain the word "smell"我想列出所有包含“气味”这个词的句子

word = 'smell'
selected_list = []
for i in range(0, len(df)):
    if word in df.iloc[i,0]:
        selected_list.append(df.iloc[i,0])
selected_list

The output that I get is:我得到的 output 是：

['I like the smell', 'The smell is awesome', 'I love the smell']

However, I want to list also sentences that contain a similar word to "smell" such as "scent" and I want to use the pre-trained word2vec of Google and set up a condition, if the similarity is above 0.5 to list the sentence as well.但是，我还想列出包含与“气味”相似的单词的句子，例如“气味”，并且我想使用谷歌的预训练 word2vec 并设置一个条件，如果相似度高于 0.5 来列出句子以及。 Therefore, the desired output is:因此，所需的 output 为：

['The scent is nice', 'I like the smell', 'The smell is awesome', 'I find the scent amazing','I love the smell']

How can I add word2vec to the above code so that it scans not only for "smell" but also all similar words?如何将 word2vec 添加到上述代码中，以便它不仅可以扫描"smell" ，还可以扫描所有相似的词？

Answer 1

It sounds like you'll want to compare each word in a candidate text to your query-word, and then see if one (or more) of the most-similar words are over your threshold.听起来您希望将候选文本中的每个单词与您的查询词进行比较，然后查看一个（或多个）最相似的单词是否超过了您的阈值。

That will require tokenizing the raw texts into words that are suitable for lookup against your set-of-word-vectors, then comparing/sorting the results, then checking them against your threshold.这将需要将原始文本标记为适合根据您的词向量集查找的词，然后比较/排序结果，然后根据您的阈值检查它们。

The core of what you need to do could be the following function, reliant on the word-vectors support in Python library gensim :您需要做的核心可能是以下 function，依赖于 Python 库gensim中的词向量支持：

def rank_by_similarity(target_word, list_of_words, word_vectors):
    """Return ranked list of (similarity_score, word) of words in
    list_of_words, by similarity to target_word, using the set of
    vectors in word_vectors (a gensim.models.KeyedVectors instance)."""

    sim_pairs = [(word_vectors.similarity(target_word, word), word) 
                 for word in list_of_words]
    sim_pairs.sort(reverse=True)  # put largest similarities first
    return sim_pairs

With use in context:在上下文中使用：

from gensim.models import KeyedVectors

all_sentences = [
    'That looks nice',
    'The scent is nice',
    'It tastes great', 
    'I like the smell',
    'Wow that\'s hot',
]
query_word = 'smell'
threshold = 0.5

goog_vecs = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

selected_sentences = []
for sentence in all_sentences:
    tokens = sentence.split()
    ranked_tokens = rank_by_similarity(query_word, tokens, goog_vecs)
    if ranked_tokens[0][0] > threshold:  # if top match similarity > threshold...
        selected_sentences.append(sentence)  # ...consider it a match
print(selected_sentences)

使用 word2vec 列出所有包含特定单词和相似单词的句子

问题描述

1 个解决方案

解决方案1
0 2020-07-13 19:49:27

使用 word2vec 列出所有包含特定单词和相似单词的句子

问题描述

1 个解决方案

解决方案1 0 2020-07-13 19:49:27

解决方案1
0 2020-07-13 19:49:27