[英]Clustering of Tags
I have a dataset (~80k rows) that contains a comma-separated list of tags (skills), for example: 我有一个数据集(约80k行),其中包含用逗号分隔的标签(技能)列表,例如:
python, java, javascript,
marketing, communications, leadership,
web development, node.js, react
...
Some are as short as 1, others can be as long as 50+ skills. 有些短至1,有些则可能长达50+。 I would like to cluster groups of skills together (Intuitively, people in same cluster would have a very similar set of skills)
我想将技能组合在一起(直觉上,同一群集中的人将具有非常相似的一组技能)
First, I use CountVectorizer
from sklearn
to vectorise the list of words and perform a dimensionr reduction using SVD
, reducing it to 50 dimensions (from 500+). 首先,我使用
CountVectorizer
的sklearn
对单词列表进行矢量化处理,并使用SVD
执行sklearn
维器缩减,将其缩减为50维(从500多个)。 Finally, I perform KMeans
Clustering with n=50
, but the results are not optimal -- Groups of skills clustered together seems to be very unrelated. 最后,我用
n=50
进行KMeans
聚类,但结果并不是最佳的-聚类在一起的技能组似乎无关紧要。
How should I go about improving the results? 我应该如何改善结果? I'm also not sure if
SVD
is the most appropriate form of dimension reduction for this use case. 我也不确定
SVD
是否是此用例最合适的降维形式。
I would start with the following approaches: 我将从以下方法开始:
For any approach (including yours), don't give up before you do some hyper-parameter tuning. 对于任何方法(包括您的方法),在进行一些超参数调整之前都不要放弃。 Maybe all you need is a smaller representation, or another K (for the KMeans).
也许您只需要一个较小的表示形式,或另一个K(对于KMeans)。
Good luck! 祝好运!
All the TF-IDF, cosine, etc. only works well for very long texts, where the vectors can be seen to model a term frequency distribution with reasonable numeric accuracy. 所有的TF-IDF,余弦等仅适用于很长的文本,在这些文本中可以看到向量以合理的数值精度来建模项频率分布。 For short texts, this is not reliable enough to produce useful clusters.
对于短文本,这不足以产生有用的簇。
Furthermore, k-means needs to put every record into a cluster. 此外,k-means需要将每条记录放入一个群集中。 But what about nonsense data - say someone with the only skill "Klingon"?
但是,胡说八道的数据呢?说一个只有“克林贡”技能的人吗?
Instead, use 相反,使用
This makes perfect sense on tags. 这在标签上非常有意义。 It identifies groups of tags that occur frequently together.
它标识经常一起出现的标签组。 So one pattern is, eg, "python sklearn, numpy";
因此,一种模式是例如“ python sklearn,numpy”; and the cluster is all the users that have these skills.
而集群就是拥有这些技能的所有用户。
Note that these clusters will overlap , and some may be in no clusters. 请注意,这些群集将重叠 ,有些可能不在群集中。 That is of course harder to use, but for most applications it makes sense that records can belong to multiple, or no, clusters.
这当然更难使用,但是对于大多数应用程序来说,记录可以属于多个群集或不属于群集是有意义的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.