简体   繁体   English

为什么 DBSCAN 聚类在电影镜头数据集上返回单个聚类?

[英]Why DBSCAN clustering returns single cluster on Movie lens data set?

The Scenario:场景:

I'm performing Clustering over Movie Lens Dataset, where I have this Dataset in 2 formats:我正在对电影镜头数据集执行聚类,我有两种格式的数据集:

OLD FORMAT:旧格式:

uid iid rat
941 1   5
941 7   4
941 15  4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4

NEW FORMAT:新格式:

uid 1               2               3               4
1   5               3               4               3
2   4               3.6185548023    3.646073985     3.9238342172
3   2.8978348799    2.6692556753    2.7693015618    2.8973463681
4   4.3320762062    4.3407749532    4.3111995162    4.3411425423
940 3.7996234581    3.4979386925    3.5707888503    2
941 5               NaN             NaN             NaN
942 4.5762594612    4.2752554573    4.2522440019    4.3761477591
943 3.8252406362    5               3.3748860659    3.8487417604

over which I need to perform Clustering using KMeans, DBSCAN and HDBSCAN.我需要使用 KMeans、DBSCAN 和 HDBSCAN 执行聚类。 With KMeans I'm able to set and get clusters.使用 KMeans,我可以设置和获取集群。

The Problem问题

The Problem persists only with DBSCAN & HDBSCAN that I'm unable to get enough amount of clusters (I do know we cannot set Clusters manually)问题仅在 DBSCAN 和 HDBSCAN 中存在,我无法获得足够数量的集群(我知道我们无法手动设置集群)

Techniques Tried:尝试的技术:

  • Tried this with IRIS data-set , where I found Species wasn't included.使用IRIS data-set 进行了尝试,我发现其中不包括Species Clearly that is in String and besides is to be predicted, and everything just works fine with that Dataset (Snippet 1)显然,这是在字符串中,此外还需要预测,并且该数据集一切正常(代码片段 1)
  • Tried with Movie Lens 100K dataset in OLD FORMAT (with and without UID) since I tried an Analogy that, UID == SPECIES and hence tried without it.尝试使用旧格式的 Movie Lens 100K数据集(有和没有 UID),因为我尝试了一个类比,UID == SPECIES,因此尝试没有它。 (Snippet 2) (片段 2)
  • Tried same with NEW FORMAT (with and without UID) yet the results ended up in same style.用新格式(有和没有 UID)尝试过相同的结果,但结果以相同的风格结束。

Snippet 1:片段 1:

print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris

iris = load_iris()
dbscan = DBSCAN()

d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)

Snippet 1 (Output):片段 1(输出):

FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]: 
array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1,  1,  1,
        1,  1,  1, -1, -1,  1, -1, -1,  1,  1,  1,  1,  1,  1,  1, -1, -1,
        1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1, -1, -1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

Snippet 2:片段 2:

import pandas as pd
from sklearn.cluster import DBSCAN

data_set = pd.DataFrame

ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
    data_set = pd.read_csv("MainMatrix_IBCF.csv")
    data_set = data_set.iloc[:, 1:]
    data_set = data_set.dropna()
elif ch is 2:
    data_set = pd.read_csv("MainMatrix_UBCF.csv")
    data_set = data_set.iloc[:, 1:]
    data_set = data_set.dropna()
else:
    print "Enter Proper choice!"

print "Starting with DBSCAN for Clustering on\n", data_set.info()

db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)

Snippet 2 (Output):片段 2(输出):

Extended Cluster Methods for:
1. Main Matrix IBCF 
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])

As seen, it returns only 1 Cluster.正如所见,它只返回 1 个集群。 I'd like to hear what am I doing wrong.我想听听我做错了什么。

You need to choose appropriate parameters.需要选择合适的参数。 With a too small epsilon, everything becomes noise. epsilon 太小,一切都会变成噪音。 sklearn shouldn't have a default value for this parameter, it needs to be chosen for each data set differently. sklearn不应为此参数设置默认值,需要为每个数据集选择不同的值。

You also need to preprocess your data.您还需要预处理您的数据。

It's trivial to get "clusters" with kmeans that are meaningless...用毫无意义的kmeans获得“集群”是微不足道的......

Don't just call random functions.不要只是调用随机函数。 You need to understand what you are doing, or you are just wasting your time.你需要了解你在做什么,否则你只是在浪费时间。

As pointed by @faraway and @Anony-Mousse, the solution is more Mathematical on Dataset than Programming.正如@faraway 和@Anony-Mousse 所指出的,解决方案在数据集上比编程更数学化。

Could finally figure out the clusters.终于可以弄清楚集群了。 Here's how:方法如下:

db_cluster = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2)
arr = db_cluster.fit_predict(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)

uni, counts = np.unique(arr, return_counts=True)
d = dict(zip(uni, counts))
print d

The Epsilon and Out-lier concept turned out more brightening from SO: How can I choose eps and minPts (two parameters for DBSCAN algorithm) for efficient results? Epsilon 和 Out-lier 概念从SO 中得到了更多启发:如何选择 eps 和 minPts(DBSCAN 算法的两个参数)以获得有效结果? . .

Firstly you need to preprocess your data removing any useless attribute such as ids, and incomplete instances (in case your chosen distance measure can't handle it).首先,您需要预处理您的数据,删除任何无用的属性,例如 id 和不完整的实例(以防您选择的距离度量无法处理它)。

It's good to understand that these algorithms are from two different paradigms, centroid-based (KMeans) and density-based (DBSCAN & HDBSCAN*).很高兴理解这些算法来自两种不同的范式,基于质心 (KMeans) 和基于密度 (DBSCAN & HDBSCAN*)。 While centroid-based algorithms usually have the number of clusters as a input parameter, density-based algorithms need the number of neighbors (minPts) and the radius of the neighborhood (eps).虽然基于质心的算法通常将簇数作为输入参数,但基于密度的算法需要邻居数 (minPts) 和邻域半径 (eps)。

Normally in the literature the number of neighbors (minPts) is set to 4 and the radius (eps) is found through analyzing different values.通常在文献中,邻居的数量 (minPts) 设置为 4,并且通过分析不同的值找到半径 (eps)。 You may find HDBSCAN* easier to use as you only need to inform the number of neighbors (minPts).您可能会发现 HDBSCAN* 更易于使用,因为您只需告知邻居的数量 (minPts)。

If after trying different configurations, you still getting useless clusterings, maybe your data haven't clusters at all and the KMeans output is meaningless.如果在尝试不同的配置后,您仍然得到无用的聚类,则可能您的数据根本没有聚类并且 KMeans 输出毫无意义。

Have you tried seeing how the cluster looks in 2D space using PCA (eg).您是否尝试过使用 PCA(例如)查看集群在 2D 空间中的外观。 If whole data is dense and actually forms single group probably then you might get single cluster.如果整个数据很密集并且实际上可能形成单个组,那么您可能会得到单个集群。

Change other parameters like min_samples=5, algorithm, metric.更改其他参数,如 min_samples=5、算法、度量。 Possible value of algorithm and metric you can check from sklearn.neighbors.VALID_METRICS.您可以从 sklearn.neighbors.VALID_METRICS 检查算法和度量的可能值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM