简体   繁体   English

标签聚类

[英]Clustering of Tags

I have a dataset (~80k rows) that contains a comma-separated list of tags (skills), for example: 我有一个数据集(约80k行),其中包含用逗号分隔的标签(技能)列表,例如:

python, java, javascript,
marketing, communications, leadership,
web development, node.js, react
...

Some are as short as 1, others can be as long as 50+ skills. 有些短至1,有些则可能长达50+。 I would like to cluster groups of skills together (Intuitively, people in same cluster would have a very similar set of skills) 我想将技能组合在一起(直觉上,同一群集中的人将具有非常相似的一组技能)

First, I use CountVectorizer from sklearn to vectorise the list of words and perform a dimensionr reduction using SVD , reducing it to 50 dimensions (from 500+). 首先,我使用CountVectorizersklearn对单词列表进行矢量化处理,并使用SVD执行sklearn维器缩减,将其缩减为50维(从500多个)。 Finally, I perform KMeans Clustering with n=50 , but the results are not optimal -- Groups of skills clustered together seems to be very unrelated. 最后,我用n=50进行KMeans聚类,但结果并不是最佳的-聚类在一起的技能组似乎无关紧要。

How should I go about improving the results? 我应该如何改善结果? I'm also not sure if SVD is the most appropriate form of dimension reduction for this use case. 我也不确定SVD是否是此用例最合适的降维形式。

I would start with the following approaches: 我将从以下方法开始:

  1. If you have enough data, try something like word2vec to get an embedding for each tag. 如果您有足够的数据,请尝试使用诸如word2vec之类的方法为每个标签嵌入代码。 You can use pre-trained models, but probably better to train on you own data since it has unique semantics. 您可以使用经过预训练的模型,但由于它具有独特的语义,因此最好对自己的数据进行训练。 Make sure you have an OOV embedding for tags that don't appear enough times. 确保您对没有足够多次出现的标签进行了OOV嵌入。 Then use K-means, Agglomerative Hierarchical Clustering, or other known clustering methods. 然后,使用K均值,聚集层次聚类或其他已知的聚类方法。
  2. I would construct a weighted undirected-graph, where each tag is a node, and edges represent the number of times 2 tags appeared in the same list. 我将构造一个加权的无向图,其中每个标签是一个节点,边缘表示2个标签出现在同一列表中的次数。 Once the graph is constructed, I would use a community detection algorithm for clustering. 构建图后,我将使用社区检测算法进行聚类。 Networkx is a very nice library in python that lets you do that. Networkx是python中一个非常漂亮的库,可让您做到这一点。

For any approach (including yours), don't give up before you do some hyper-parameter tuning. 对于任何方法(包括您的方法),在进行一些超参数调整之前都不要放弃。 Maybe all you need is a smaller representation, or another K (for the KMeans). 也许您只需要一个较小的表示形式,或另一个K(对于KMeans)。

Good luck! 祝好运!

All the TF-IDF, cosine, etc. only works well for very long texts, where the vectors can be seen to model a term frequency distribution with reasonable numeric accuracy. 所有的TF-IDF,余弦等仅适用于很长的文本,在这些文本中可以看到向量以合理的数值精度来建模项频率分布。 For short texts, this is not reliable enough to produce useful clusters. 对于短文本,这不足以产生有用的簇。

Furthermore, k-means needs to put every record into a cluster. 此外,k-means需要将每条记录放入一个群集中。 But what about nonsense data - say someone with the only skill "Klingon"? 但是,胡说八道的数据呢?说一个只有“克林贡”技能的人吗?

Instead, use 相反,使用

Frequent Itemset Mining 频繁项集挖掘

This makes perfect sense on tags. 这在标签上非常有意义。 It identifies groups of tags that occur frequently together. 它标识经常一起出现的标签组。 So one pattern is, eg, "python sklearn, numpy"; 因此,一种模式是例如“ python sklearn,numpy”; and the cluster is all the users that have these skills. 而集群就是拥有这些技能的所有用户。

Note that these clusters will overlap , and some may be in no clusters. 请注意,这些群集将重叠 ,有些可能不在群集中。 That is of course harder to use, but for most applications it makes sense that records can belong to multiple, or no, clusters. 这当然更难使用,但是对于大多数应用程序来说,记录可以属于多个群集或不属于群集是有意义的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM