Anytime Cell-based DBSCAN Algorithm that Connects Randomly Selected Cells and Its Performance Evaluation
Data clustering is an unsupervised learning method that finds groups of similar features in a dataset without defining class labels. With the growing interest in big data, there is a need for techniques to speed up clustering algorithms. One clustering method that is well-known in the database domain is density-based spatial clustering of applications with noise (DBSCAN). Since the DBSCAN algorithm was first proposed, several speed-up methods have been introduced, such as the cell-based DBSCAN algorithm. The cell-based DBSCAN algorithm divides the whole dataset into smaller cells and connects them to form clusters. In this paper we propose a novel clustering algorithm called the anytime cell-based DBSCAN algorithm. The proposed algorithm connects some randomly selected cells and calculates the clustering result at high speed. The process is then repeated to improve the clustering accuracy and obtain accurate results. In this paper, we report experimental results on synthetic and real datasets showing that the proposed algorithm can calculate clustering results with high accuracy at high speed.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proc. KDD 1996, 1996, pp. 226–231.
J. Sander, M. Ester, H.-P. Kriegel, and X. Xu, “Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 169–194, 1998.
A. Gunawan, “A faster algorithm for DBSCAN,” Master’s thesis, Technische University Eindhoven, 40 pages, 2013.
J. Gan and Y. Tao, “DBSCAN revisited: Mis-claim, un-fixability, and approximation,” in Proc. SIGMOD 2015, 2015, pp. 519–530.
——, “On the hardness and approximation of euclidean dbscan,” ACM Trans. Database Syst., vol. 42, no. 3, pp. 14:1–14:45, 2017.
S. T. Mai, I. Assent, and M. Storgaard, “AnyDBC: An efficient anytime density-based clustering algorithm for very large complex datasets,” in Proc. KDD 2016, 2016, pp. 1025–1034.
S. Mai Thai, I. Assent, J. Jacobsen, and M. Dieu, “Anytime parallel density-based clustering,” Data Mining and Knowledge Discovery, vol. 32, pp. 1121–1176, 04 2018.
T. Sakai, K. Tamura, and H. Kitakami, “Cell-based DBSCAN algorithm using minimum bounding rectangle criteria,” in Proc. BDMS 2017, 2017, pp. 133–144.
T. Sakai, K. Tamura, H. Kitakami, and T. Takezawa, “Speed-up of cell based DBSCAN using minimum bounding rectangle and recursive cell partitioning,” The IEICE transactions on information and systems, vol. J101-D, no. 4, 2018.
S. Zilberstein, “Using anytime algorithms in intelligent systems,” vol. 17, pp. 73–83, 1996.
Y. El-Sonbaty, M. A. Ismail, and M. Farouk, “An efficient density based clustering algorithm for large databases,” in Proc. ICTAI 2004, 2004, pp. 673–677.
S. Mahran and K. Mahar, “Using grid for accelerating density-based clustering,” in Proc. CIT 2008, 2008, pp. 35–40.
S. Zhou, A. Zhou, J. Cao, J. Wen, Y. Fan, and Y. Hu, “Combining sampling technique with DBSCAN algorithm for clustering large spatial databases,” in Proc. PAKDD 2000, 2000, pp. 169–172.
M. Dash, H. Liu, and X. Xu, “’1+1>2’: merging distance and density based clustering,” in Proc. DASFAA 2001, 2001, pp. 32–39.
X. Wang and H. J. Hamilton, “DBRS: A density-based spatial clustering method with random sampling,” in Proc. PAKDD 2003, 2003, pp. 563–575.
B. Borah and D. K. Bhattacharyya, “An improved sampling-based DBSCAN for large spatial databases,” in Proc. ICISIP 2004, 2004, pp. 92–96.
B. Liu, “A fast density-based clustering algorithm for large databases,” in Proc. ICMLC 2006, 2006, pp. 996–1000.
C.-F. Tsai and C.-T. Wu, “GF-DBSCAN: A new efficient and effective data clustering technique for large databases,” in Proc. MUSP 2009, 2009, pp. 231–236.
Z. Wang, R. Zhang, J. Qi, and B. Yuan, “Dbsvec: Density-based clustering using support vector expansion,” in Proc.of ICDE 2019, 2019, pp. 280–291.
N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Is a correction for chance necessary?” in Proc. ICML 2009, 2009, pp. 1073–1080.
A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in Proc. ISWC 2012, 2012, pp. 108–109.
M. Varma and A. Zisserman, “Texture classification: are filter banks necessary?” in Proc. CVPR 2003, vol. 2, 2003, pp. II–691–8 vol.2.
B. K and L. M, “UCI machine learning repository,” 2013.