简体   繁体   English

使用word2vec对类别中的单词进行分类

[英]Using word2vec to classify words in categories

BACKGROUND 背景

I have vectors with some sample data and each vector has a category name (Places,Colors,Names). 我有一些带有一些样本数据的向量,每个向量都有一个类别名称(地点,颜色,名称)。

['john','jay','dan','nathan','bob']  -> 'Names'
['yellow', 'red','green'] -> 'Colors'
['tokyo','bejing','washington','mumbai'] -> 'Places'

My objective is to train a model that take a new input string and predict which category it belongs to. 我的目标是训练一个模型,该模型采用新的输入字符串并预测它属于哪个类别。 For example if a new input is "purple" then I should be able to predict 'Colors' as the correct category. 例如,如果新输入是“紫色”,那么我应该能够将“颜色”预测为正确的类别。 If the new input is "Calgary" it should predict 'Places' as the correct category. 如果新输入是“卡尔加里”,则应将“地点”预测为正确的类别。

APPROACH APPROACH

I did some research and came across Word2vec . 我做了一些研究,遇到了Word2vec This library has a "similarity" and "mostsimilarity" function which i can use. 该库具有我可以使用的“相似性”和“最相似性”功能。 So one brute force approach I thought of is the following: 所以我想到的一种蛮力方法如下:

  1. Take new input. 接受新的投入。
  2. Calculate it's similarity with each word in each vector and take an average. 计算它与每个向量中每个单词的相似度并取平均值。

So for instance for input "pink" I can calculate its similarity with words in vector "names" take a average and then do that for the other 2 vectors also. 因此,例如对于输入“粉红色”,我可以计算其与向量“名称”中的单词的相似性取平均值,然后对其他2个向量也执行此操作。 The vector that gives me the highest similarity average would be the correct vector for the input to belong to. 给出最高相似度平均值的向量将是输入所属的正确向量。

ISSUE 问题

Given my limited knowledge in NLP and machine learning I am not sure if that is the best approach and hence I am looking for help and suggestions on better approaches to solve my problem. 鉴于我在NLP和机器学习方面的知识有限,我不确定这是否是最好的方法,因此我正在寻找有关解决问题的更好方法的帮助和建议。 I am open to all suggestions and also please point out any mistakes I may have made as I am new to machine learning and NLP world. 我对所有建议持开放态度,并请指出我可能因为我是机器学习和NLP世界的新手而犯的任何错误。

If you're looking for the simplest / fastest solution then I'd suggest you take the pre-trained word embeddings (Word2Vec or GloVe) and just build a simple query system on top of it. 如果您正在寻找最简单/最快速的解决方案,那么我建议您采用预先训练好的单词嵌入(Word2Vec或GloVe),并在其上构建一个简单的查询系统。 The vectors have been trained on a huge corpus and are likely to contain good enough approximation to your domain data. 这些向量已经在一个庞大的语料库中进行了训练,并且可能包含对您的域数据足够好的近似值。

Here's my solution below: 以下是我的解决方案:

import numpy as np

# Category -> words
data = {
  'Names': ['john','jay','dan','nathan','bob'],
  'Colors': ['yellow', 'red','green'],
  'Places': ['tokyo','bejing','washington','mumbai'],
}
# Words -> category
categories = {word: key for key, words in data.items() for word in words}

# Load the whole embedding matrix
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
  for line in f:
    values = line.split()
    word = values[0]
    embed = np.array(values[1:], dtype=np.float32)
    embeddings_index[word] = embed
print('Loaded %s word vectors.' % len(embeddings_index))
# Embeddings for available words
data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}

# Processing the query
def process(query):
  query_embed = embeddings_index[query]
  scores = {}
  for word, embed in data_embeddings.items():
    category = categories[word]
    dist = query_embed.dot(embed)
    dist /= len(data[category])
    scores[category] = scores.get(category, 0) + dist
  return scores

# Testing
print(process('pink'))
print(process('frank'))
print(process('moscow'))

In order to run it, you'll have to download and unpack the pre-trained GloVe data from here (careful, 800Mb!). 为了运行它,你必须从这里下载并解压缩预训练的GloVe数据(小心,800Mb!)。 Upon running, it should produce something like this: 在运行时,它应该产生这样的东西:

{'Colors': 24.655489603678387, 'Names': 5.058711671829224, 'Places': 0.90213905274868011}
{'Colors': 6.8597321510314941, 'Names': 15.570847320556641, 'Places': 3.5302454829216003}
{'Colors': 8.2919375101725254, 'Names': 4.58830726146698, 'Places': 14.7840416431427}

... which looks pretty reasonable. ......看起来很合理。 And that's it! 就是这样! If you don't need such a big model, you can filter the words in glove according to their tf-idf score. 如果你不需要这么大的模型,你可以根据他们的tf-idf分数过滤glove的单词。 Remember that the model size only depends on the data you have and words you might want to be able to query. 请记住,模型大小仅取决于您拥有的数据和您可能希望查询的单词。

此外,它的价值在于PyTorch最近实现了Glove的良好和快速实施

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM