简体   繁体   中英

How to remove noise in DBSCAN clustering for text data in Python and Sklearn?

Suppose my text data is as shown below, in the form of list.

l = ['have approved 13 request its showing queue note data been sync move out these request from queue', 'note have approved 12 requests its showing queue note data been sync move out all request from queue', 'have approved 2 request its showing queue note data been sync move out of these 2 request ch 30420 cr 13861']

I am using the TFIDFVectorizer and DBSCAN Clustering to cluster this text and give them a label.

vect = TfidfVectorizer(ngram_range=(3,4), min_df = 1, max_df = 1.0, decode_error = "ignore")
tfidf = vect.fit_transform(l)
a = (tfidf * tfidf.T).A
db_a = DBSCAN(eps=0.3, min_samples=5).fit(a)
lab = db_a.labels_
print lab

I get the output as

  `array([-1, -1, -1])`

So basically DBSCAN is labeling all my data as '-1' which is categorizing it as noise as mentioned in the sklearn DBSCAN documentation.

If you have only 3 items, but require a minPts of 5 items to become dense , all your data by definition is noise: they do not have 5 neighbors within their eps radius.

Use much more data if you want density-based clusters... (I do not recommend reducing minPts below 5; usually should be chosen larger to produce meaningful results. If you reduce minPts too much, you just get single-link clustering with all its drawbacks.)

Also note that you need to choose eps in a way that it captures similar documents. Ie documents that you consider to be very similar should have a distance below epsilon, and objects that you consider dissimilar must have a distance larger than epsilon.

Although Erich Schubert's answer is the most holistic one, I want to add that you could also set:

minPts = 1

to prevent the creation of any noise as each point will become a cluster if it doesn't have any neighbours near it. However, this will produce less meaningful results as stated above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM