简体   繁体   English

使用 word2vec 列出所有包含特定单词和相似单词的句子

[英]List all sentences containing specific word and similar words with word2vec

I have a table as following:我有一张如下表:

data = {'text':  ['The scent is nice','I like the smell', 'The smell is awesome', 'I find the scent amazing', 'I love the smell']}

df = pd.DataFrame (data, columns = ['text'])

I want to list all sentences that contain the word "smell"我想列出所有包含“气味”这个词的句子

word = 'smell'
selected_list = []
for i in range(0, len(df)):
    if word in df.iloc[i,0]:
        selected_list.append(df.iloc[i,0])
selected_list

The output that I get is:我得到的 output 是:

['I like the smell', 'The smell is awesome', 'I love the smell']

However, I want to list also sentences that contain a similar word to "smell" such as "scent" and I want to use the pre-trained word2vec of Google and set up a condition, if the similarity is above 0.5 to list the sentence as well.但是,我还想列出包含与“气味”相似的单词的句子,例如“气味”,并且我想使用谷歌的预训练 word2vec 并设置一个条件,如果相似度高于 0.5 来列出句子以及。 Therefore, the desired output is:因此,所需的 output 为:

['The scent is nice', 'I like the smell', 'The smell is awesome', 'I find the scent amazing','I love the smell']

How can I add word2vec to the above code so that it scans not only for "smell" but also all similar words?如何将 word2vec 添加到上述代码中,以便它不仅可以扫描"smell" ,还可以扫描所有相似的词?

It sounds like you'll want to compare each word in a candidate text to your query-word, and then see if one (or more) of the most-similar words are over your threshold.听起来您希望将候选文本中的每个单词与您的查询词进行比较,然后查看一个(或多个)最相似的单词是否超过了您的阈值。

That will require tokenizing the raw texts into words that are suitable for lookup against your set-of-word-vectors, then comparing/sorting the results, then checking them against your threshold.这将需要将原始文本标记为适合根据您的词向量集查找的词,然后比较/排序结果,然后根据您的阈值检查它们。

The core of what you need to do could be the following function, reliant on the word-vectors support in Python library gensim :您需要做的核心可能是以下 function,依赖于 Python 库gensim中的词向量支持:

def rank_by_similarity(target_word, list_of_words, word_vectors):
    """Return ranked list of (similarity_score, word) of words in
    list_of_words, by similarity to target_word, using the set of
    vectors in word_vectors (a gensim.models.KeyedVectors instance)."""

    sim_pairs = [(word_vectors.similarity(target_word, word), word) 
                 for word in list_of_words]
    sim_pairs.sort(reverse=True)  # put largest similarities first
    return sim_pairs

With use in context:在上下文中使用:

from gensim.models import KeyedVectors

all_sentences = [
    'That looks nice',
    'The scent is nice',
    'It tastes great', 
    'I like the smell',
    'Wow that\'s hot',
]
query_word = 'smell'
threshold = 0.5

goog_vecs = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

selected_sentences = []
for sentence in all_sentences:
    tokens = sentence.split()
    ranked_tokens = rank_by_similarity(query_word, tokens, goog_vecs)
    if ranked_tokens[0][0] > threshold:  # if top match similarity > threshold...
        selected_sentences.append(sentence)  # ...consider it a match
print(selected_sentences)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM