简体   繁体   English

如何在 Python 中将单词分类?

[英]How can I sort words into categories in Python?

I work on a project where I use Google vision to detect objects in images.我从事一个项目,我使用谷歌视觉来检测图像中的对象。 The API returns a list of labels. API 返回标签列表。 So I have multiple words and I would like to put each word into a category.所以我有多个单词,我想将每个单词归入一个类别。 For example:例如:

Google cloud vision returns:谷歌云视觉回归:

['Head', 'Lamp', 'Eye', 'Green', 'Arm', 'Piano', 'Mobile phone', 'Blue', 'Toy']

And I would like to have something like:我想有类似的东西:

{'Object' : ['Lamp', 'Piano', 'Mobile phone', 'Toy'],
'Color' : ['Green', 'Blue'],
'Body parts': ['Head', 'Eye', 'Arm']
}

I know that word2vec have something called similarity but it means that I have to train a model.我知道 word2vec 有一种叫做相似性的东西,但这意味着我必须训练一个 model。 Is there any pretained model I can use?有没有我可以使用的预置 model ? Or maybe another solution to do this?或者也许是另一种解决方案?

Check out WordNet , a free research lexicon which models word-relationships, and could thus help you group those labels in a variety of ways:查看WordNet ,这是一个免费的研究词典,它对单词关系进行建模,因此可以帮助您以多种方式对这些标签进行分组:

https://wordnet.princeton.edu/ https://wordnet.princeton.edu/

You could try to use word-vectors, as well, to model degrees-of-similarity between words - and thus also potentially cluster related words.您也可以尝试使用单词向量来 model 单词之间的相似度 - 因此也可能对相关单词进行聚类。 And, there are off-the-shelf sets of word-vectors in various languages you could try using rather than training your own.而且,您可以尝试使用各种语言的现成词向量集,而不是自己训练。

However, the similarity reflected by such sets may or may not be what you want for your purposes.但是,此类集合所反映的相似性可能是也可能不是您想要的。 For example, antonyms like 'hot' and 'cold' are typically very 'similar' in most word-vector models, as they concern the same aspect of something and are used in similar contexts.例如,像“热”和“冷”这样的反义词在大多数词向量模型中通常非常“相似”,因为它们涉及事物的相同方面并且在相似的上下文中使用。 And logical hierarchies – such as words being a more-specific example of others – won't necessarily be clear in word-vector spaces.并且逻辑层次结构——例如词是其他词的更具体的例子——在词向量空间中不一定是清晰的。 (WordNet, as a manually-curated dataset, captures such hypernym/hyponym relationships explicitly.) (WordNet,作为一个手动管理的数据集,明确地捕获了这种上位词/下位词关系。)

Try this hope it fits the criteria:试试这个希望它符合标准:

#This is the list of things you want sorted
lists = ['Head', 'Lamp', 'Eye', 'Green', 'Arm', 'Piano', 'Mobile phone', 'Blue', 'Toy']
#This is all the body parts this program searches for
body = ["Head","Eye","Arm"]
body2= []
#All the objects it searches for
object = ["Lamp","Piano","Mobile phone","Toy"]
object2 = []
#All the colors it search for
color = ["Blue","Green"]
color2 = []
#formats the dict according to your criteria 
dict = {"Object":None, "Color":None,"Body":None}
for i in lists:
    if i in object:
        if i in object2:
            continue
        object2.append(i)
        dict["Object"] = object2
    elif i in color:
        if i in color2:
            continue
        color2.append(i)
        dict["Color"] = color2
    elif i in body:
        if i in body2:
            continue
        body2.append(i)
        dict["Body"] = body2
    else:
          pass
print(dict)

Test it and see if it fits all your criteria测试它,看看它是否符合您的所有标准

证明它有效

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM