Search

link to homepage

Institute for Advanced Simulation (IAS)

Navigation and service


Clustering

HPDBSCAN

Research activities in clustering focus on a Highly Parallelizable Density-Based Spatial Clustering of Applications with Noise (HPDBSCAN) code that is available as opensource.

Description

HPDBSCAN data-preprocessing strategy is based onspatial cells with spatial indexing and redistribution according to the point localities as shown in Figure 1. It thus enables data density-based chunking of computations.

HPDBSCANFigure 1: HPDBSCAN Data Preprocessing into Spatial Cells
Copyright: FZJ/JSC

Computational optimizations to work with large quantities of datasets take advantage of caching of neighborhood searches. Further optimizations are cluster merging based on comparisons instead of zone reclustering.

Features

  • Parallel MPI/OpenMP C++ code implementation
  • HDF5 data format inputs/outputs
  • open-source and best-effort support for academic applications

Performance Evaluation

Figure 2 shows the scalability of the HPDBSCAN code with an increasing number ofdata points using different number of cores.

Evaluation of the  scalable HPDBSCAN implementationFigure 2: Evaluation of the scalable HPDBSCAN implementation
Copyright: FZJ/JSC

Related Work

Figure 3 shows performance comparisons with another parallel implementation of DBSCAN [REF1] with respect to computation time, speed-up, and memory use. The comparisons used 3.7056.351 data points in 2 dimensions.

cores12481632
Computation time [s]
JSC-HPDBSCAN117,1859,6430,6816,2510,869,39
NWU-PDSDBSCAN288,35162,47105,9489,8785,3788,42
Speed-up [x]
JSC-HPDBSCAN1,001,963,827,7110,7912,48
NWU-PDSDBSCAN1,001,772,723,213,383,36
Memory [MB]
JSC-HPDBSCAN251,064345,276433,340678,2481,101,0002,111,000
NWU-PDSDBSCAN500,512725,1041,370,0004,954,00019,724,00059,685,000
Nodes11248163264
Cores112244896192384768
Compute Time [hh:mm:ss]22:04:582:11:551:10:580:36:240:18:510:09:560:05:450:03:22
Speed-up [x]1.010.018.736.470.3133.3230.3392.7

Performance Comparisons of HPDBSCAN with NWU-PDSDBSCANFigure 3: Performance Comparisons show HPDBSCAN with low computation time, high speed-up, and high memory efficiency compared to another parallel implementation (NWU-PDSDBSCAN).
Copyright: FZJ/JSC

Background

Clustering is an unsupervised learning technique with a broad class of methods for discovering previously unknown subgroups in data as shown in Figure 4. The technique is often used to discover interesting facts & relationships about input datasets during 'exploratory data analysis'.

Partitioning of Data in SubgroupsFigure 4: Partitioning of data in subgroups (i.e. ?clusters?) previously unknown
Copyright: FZJ/JSC

DBSCAN is one established method introduced 1996 by Martin Ester et al. [REF2] that groups a number of similar points into clusters with the following properties:

  • similarity is defined by a distance measure (euclidean distance)
  • looks for a similar point within a given search radius (epsilon)
  • Clusters consist of a given minimum number of points (minPoints)

The following distinct features of DBSCAN make it a technique of choice for many application domains:

  • clusters a variable number of Clusters
  • forms abritrary shaped clusters
  • efficiently identifies outliers/noise

[REF1] Patwary, Md Mostofa Ali, et al. "A new scalable parallel dbscanalgorithm using the disjoint-set data structure." High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for. IEEE, 2012.

[REF2] Ester, Martin, et al. "A density-based algorithm for discoveringclusters in large spatial databases with noise." Kdd. Vol. 96. 1996.


Servicemeu

Homepage