简体   繁体   中英

Should DBSCAN and its index have the same distance function

Is it required that DBSCAN and its index have the same distance function? If it is not, what are the cases when it is needed to use different distance functions?

Scala code how I create DBSCAN and index:

import de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN
import de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN
import de.lmu.ifi.dbs.elki.data.model.Model
import de.lmu.ifi.dbs.elki.data.{Clustering, DoubleVector, NumberVector}
import de.lmu.ifi.dbs.elki.database.{Database, StaticArrayDatabase}
import de.lmu.ifi.dbs.elki.datasource.ArrayAdapterDatabaseConnection
import de.lmu.ifi.dbs.elki.distance.distancefunction.NumberVectorDistanceFunction
import de.lmu.ifi.dbs.elki.distance.distancefunction.minkowski.SquaredEuclideanDistanceFunction
import de.lmu.ifi.dbs.elki.index.tree.metrical.covertree.SimplifiedCoverTree

def createDatabase(data: Array[Array[Double]], distanceFunction: NumberVectorDistanceFunction[NumberVector]): Database = {
  val indexFactory = new SimplifiedCoverTree.Factory[NumberVector](distanceFunction, 1.3, 20)
  // Create a database
  val db = new StaticArrayDatabase(new ArrayAdapterDatabaseConnection(data), java.util.Arrays.asList(indexFactory))
  // Load the data into the database
  db.initialize()
  db
}

def dbscanClustering(data: Array[Array[Double]], distanceFunction: NumberVectorDistanceFunction[NumberVector]): Unit = {
  // Use the same `distanceFunction` for the database and DBSCAN <- is it required??
  val db = createDatabase(data, distanceFunction)
  val dbscan = new DBSCAN[DoubleVector](distanceFunction, 0.01, 20)
  val result: Clustering[Model] = dbscan.run(db)
  println(s"Number of clusters: ${result.getAllClusters.size()}")
  result.getAllClusters.asScala.zipWithIndex.foreach { case (cluster, idx) =>
    println(s"# $idx: ${cluster.getNameAutomatic}")
    println(s"Size: ${cluster.size()}")
    println(s"Model: ${cluster.getModel}")
}
val inputData: Array[Array[Double]] = ???
dbscanClustering(inputData, SquaredEuclideanDistanceFunction)

The index can only be used for acceleration if it uses the same distance function. Some indexes can support multiple (but not arbitrary) distances, for example the R*-tree can support all spatial distance functions (to a varying success though).

Obviously if you build an index to accelerate Cosine distance, but you ask for Euclidean nearest neighbors, the index cannot and will not be used.

You do not need to use an index, but without your runtime will be O(n²); with an index it can be much faster (depending on parameters, dimensionality, etc. - in the worst case, the index is overhead).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM