简体   繁体   English

如何去除 Python 和 Sklearn 中文本数据的 DBSCAN 聚类中的噪声?

[英]How to remove noise in DBSCAN clustering for text data in Python and Sklearn?

Suppose my text data is as shown below, in the form of list.假设我的文本数据如下所示,以列表的形式。

l = ['have approved 13 request its showing queue note data been sync move out these request from queue', 'note have approved 12 requests its showing queue note data been sync move out all request from queue', 'have approved 2 request its showing queue note data been sync move out of these 2 request ch 30420 cr 13861']

I am using the TFIDFVectorizer and DBSCAN Clustering to cluster this text and give them a label.我正在使用 TFIDFVectorizer 和 DBSCAN 聚类来聚类此文本并给它们一个标签。

vect = TfidfVectorizer(ngram_range=(3,4), min_df = 1, max_df = 1.0, decode_error = "ignore")
tfidf = vect.fit_transform(l)
a = (tfidf * tfidf.T).A
db_a = DBSCAN(eps=0.3, min_samples=5).fit(a)
lab = db_a.labels_
print lab

I get the output as我得到的输出为

  `array([-1, -1, -1])`

So basically DBSCAN is labeling all my data as '-1' which is categorizing it as noise as mentioned in the sklearn DBSCAN documentation.所以基本上 DBSCAN 将我的所有数据标记为“-1”,这将其归类为 sklearn DBSCAN 文档中提到的噪声。

If you have only 3 items, but require a minPts of 5 items to become dense , all your data by definition is noise: they do not have 5 neighbors within their eps radius.如果您只有 3 个项目,但需要 5 个项目的minPts才能变得密集,那么根据定义,您的所有数据都是噪声:它们的eps半径内没有 5 个邻居。

Use much more data if you want density-based clusters... (I do not recommend reducing minPts below 5; usually should be chosen larger to produce meaningful results. If you reduce minPts too much, you just get single-link clustering with all its drawbacks.)如果你想基于密度的群集使用更多的数据......(我不建议减少低于5 minPts,通常应选择大到产生有意义的结果如果减少minPts太多,你只得到单链路聚类所有。它的缺点。)

Also note that you need to choose eps in a way that it captures similar documents.另请注意,您需要以捕获类似文档的方式选择eps Ie documents that you consider to be very similar should have a distance below epsilon, and objects that you consider dissimilar must have a distance larger than epsilon.即,您认为非常相似的文档的距离应小于 epsilon,而您认为不相似的对象的距离必须大于 epsilon。

Although Erich Schubert's answer is the most holistic one, I want to add that you could also set:尽管 Erich Schubert 的答案是最全面的答案,但我想补充一点,您还可以设置:

minPts = 1

to prevent the creation of any noise as each point will become a cluster if it doesn't have any neighbours near it.防止产生任何噪音,因为如果每个点附近没有任何邻居,它都会成为一个集群。 However, this will produce less meaningful results as stated above.但是,如上所述,这将产生不太有意义的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM