Multi-Scale Massive Points Fast Clustering Based on Hierarchical Density Spanning Tree

Chen, Song; Zhang, Fuhao; Zhang, Zhiran; Yu, Siyi; Qiu, Agen; Liu, Shangqin; Zhao, Xizhi

doi:10.3390/ijgi12010024

Open AccessArticle

Multi-Scale Massive Points Fast Clustering Based on Hierarchical Density Spanning Tree

¹

Chinese Academy of Surveying and Mapping, Beijing 100830, China

²

Key Laboratory of Monitoring, Evaluation and Early Warning of Territorial Spatial Planning Implementation, Ministry of Natural Resources, Chongqing 401147, China

³

School of Earth Sciences and Engineering, Xi’an Shiyou University, Xi’an 710065, China

⁴

School of Economic and Management, Shanghai University of Sport, 650 Qingyuanhuan Rd, Shanghai 200438, China

⁵

Faculty of Geomatics, Lanzhou Jiaotong University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2023, 12(1), 24; https://doi.org/10.3390/ijgi12010024

Submission received: 8 November 2022 / Revised: 18 December 2022 / Accepted: 12 January 2023 / Published: 14 January 2023

(This article belongs to the Special Issue GIS Software and Engineering for Big Data)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Spatial clustering is dependent on spatial scales. With the widespread use of web maps, a fast clustering method for multi-scale spatial elements has become a new requirement. Therefore, to cluster and display elements rapidly at different spatial scales, we propose a method called Multi-Scale Massive Points Fast Clustering based on Hierarchical Density Spanning Tree. This study refers to the basic principle of Clustering by Fast Search and Find of Density Peaks aggregation algorithm and introduces the concept of a hierarchical density-based spanning tree, combining the spatial scale with the tree links of elements to propose the corresponding pruning strategy, and finally realizes the fast multi-scale clustering of elements. The first experiment proved the time efficiency of the method in obtaining clustering results by the distance-scale adjustment of parameters. Accurate clustering results were also achieved. The second experiment demonstrated the feasibility of the method at the aggregation point element and showed its visual effect. This provides a further explanation for the application of tree-link structures.

Keywords:

multi-scale clustering; point clustering; CFSFDP algorithm; density hierarchy; tree structure

1. Introduction

Spatial clustering algorithms attempt to classify spatial elements into categories or clusters based on the similarity between their spatial locations and attributes. Clustering analysis has been widely used in urban planning, cartographic synthesis, seismic analysis, and image processing [1]. In the field of online visualization of massive data, spatial clustering has become an effective method for improving display speed and optimizing visual effects. Displaying massive amounts of data has been a challenge for visually rendering web maps. Clustering can effectively obtain class centroids with high density so that the location information of each class can be accurately displayed on the map, thereby reducing the pressure of data loading. Different spatial scales yield different clustering results [2]. Different clustering strategies are required at different scales to achieve consistent results with good visual perception.

With the rapid development of web maps, the multi-scale display of maps has led to many studies on cartographic synthesis, which includes fast aggregation processing of spatial point elements [3]. Some of the current studies have established multi-scale grids and aggregated massive points to the grid to complete clustering [4]. However, the accuracy of clustering is limited by the grid granularity [5]. The more grids, the higher the accuracy but slower the speed. To achieve a high computing speed, clustering accuracy is often reduced. K-means [6], Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [7], and other classical algorithms can effectively achieve point element clustering with high precision; however, in the face of massive amounts of data, the data processing time far exceeds the fast response time of the page required by the user. Therefore, classical clustering algorithms do not satisfy the requirements of fast clustering at multiple scales.

Improving the speed of clustering and enhancing the accuracy of clustering are important requirements for multi-scale clustering of web maps. Experiments were conducted using various classical clustering algorithms, except for the grid-clustering method. It was found that the Clustering by Fast Search and Find of Density Peaks (CFSFDP) [8] algorithm can achieve fast clustering speed and high accuracy after density values have been calculated. In the face of massive data, this method builds a tree structure between elements, which is characterized by the connection of low-density and high-density elements. Therefore, the clusters among elements can be defined quickly and accurately [9]. Currently, many scholars consider this method to have the characteristics of hierarchical clustering and have proposed their own understanding and definitions based on this tree structure [10]. For example, Lv et al. [11] uses ‘minimum spanning tree’ as the name for the structure of the relationship between the clustering center and the other elements. Xu et al. [12] uses the definition of a ‘leading tree’ to delineate the hierarchical density of spatial elements. Therefore, this paper used the Hierarchical Density Spanning Tree to describe this kind of structure. As hierarchical clustering methods (such as Single-link and Complete-link) can generally handle multi-scale clustering, the CFSFDP algorithm with a hierarchical structure also has the ability to do so. However, current multi-scale clustering methods have several issues. First, classical clustering algorithms cannot be adapted to multi-scale clustering web map scenarios that require rapid handling. Second, point clustering methods for web maps cannot obtain an accurate aggregation of complex morphological sets of elements. In addition, CFSFDP based on hierarchical tree structures have rarely been studied for multi-scale clustering.

Therefore, this paper proposes a Multi-Scale Massive Points Fast Clustering based on Hierarchical Density Spanning Tree (MSCHT) to achieve fast multi-scale clustering for the online visualization of massive data. First, based on the CFSFDP algorithm, the role of density in the generation of tree structures was analyzed to better understand the mechanism of the method. Second, a reasonable tree-link structure of relations is extracted from low-density elements to high-density elements to describe the process of merging elements. Finally, while meeting the clustering requirements of complex morphological element sets, a hierarchical density spanning tree structure was built to achieve multi-scale real-time clustering and presentation of massive point data.

The remainder of this paper is organized as follows: Section 2 reviews classical clustering methods, hierarchical grid clustering methods, and web map clustering methods for multispatial scales. In addition, this paper reviews the hierarchical density-spanning tree structure from the CFSFDP algorithm. The proposed MSCHT algorithm is described in detail in Section 3. The experiments on real databases are presented in Section 4. Finally, conclusions are drawn, and future work is discussed in Section 5.

2. Related Works

2.1. Multi Scale Clustering Methods

Scale effects have received extensive attention in clustering research, and scales are defined as data scale, analytical scale, visual scale, spatial scale, etc. [13]. Of these, spatial scale is the centerpiece of our work. Spatial scale commonly refers to the measure of spatial size used to conduct research. It is the indicator of change in spatial extent. The focus of the present study was mainly on spatial clustering at the integrated map mapping and visual scale, with different clustering results obtained at different spatial scales. In other words, this was to achieve a reasonable clustering result at each spatial scale when the human eye can effectively discern cluster separation. Clustering at multiple spatial scales can effectively extract the global distribution characteristics of geographical elements and further explore the elemental clustering characteristics of local areas. In particular, this method provides an efficient and convenient visualization tool for the current scenario of the proliferation of massive spatial data.

2.1.1. Classical Clustering Methods Based on Modified Parameters

Existing classical clustering algorithms can be classified as partitional, hierarchical, density-based, graph-based, model-based, and grid-based [14]. The classical clustering algorithm based on modified parameters mainly refers to the method of obtaining multi-scale clustering results by completely modifying its own parameters to meet the clustering requirements of different spatial scale clustering scenarios. For such methods, each clustering process at different spatial scales is independent of the others, and the clustering results are independent of each other [15]. For example, the classical clustering algorithm, K-means, can discover clusters from whole to subtle by increasing or decreasing the value of k [6]. When the value of k is smaller, each class cluster has a larger range of clustering performance and vice versa [16]. In another example, DBSCAN can modify the search radius ε (eps) and the minimum number of points required to form a dense region (minpts) to fit different spatial scales [7]. A relatively small eps value is required when the spatial extent of the scene is small, and a relatively large eps value is required when the spatial extent of the scene is large [17]. Improved clustering algorithms for the above algorithms also exist, such as Fuzzy C-Means (FCM) [18] and Ordering Points to Identify the Clustering Structure (OPTICS) [19], but most of them use the reset parameter approach to deal with multi-scale problems without the assistance of grids [13]. Although there are aggregations or splittings of known clustering results under multiple scales, the operation rules are more complicated and difficult to operate.

2.1.2. Multi-Scale Clustering Methods Based on Hierarchical Grid

Hierarchical grid-based clustering algorithms possess their own characteristics in handling multiple scales. The utilization of grids with different resolutions is key to achieving multi-scale clustering results. A typical example is the Statistical Information Grid (STING) clustering method [5]. The grid-based multi-resolution strategy divides the spatial region into moment cells. For different levels of resolution, there are usually multiple levels of rectangular cells that form a hierarchy. Each grid cell at the higher level was divided into multiple grid cells at the lower level. Statistics for each grid cell property (e.g., average, maximum, and minimum values) are precalculated and stored [20]. This method enabled effective data aggregation. Gridding has become an important auxiliary method or antecedent step in other improved multi-scale clustering methods. However, the quality of the grid clustering depends on the granularity of the lowest level of the grid structure. If the size of the grid is relatively fine, the processing cost will increase significantly [21]; if the size of the bottom layer of the grid structure is too coarse, the quality of the clustering analysis reduces. This problem is also reflected in Wave Cluster [22], Clustering in Quest (CLIQUE) [23], Entropy based Clustering (ENCLUS) [24] and other methods.

2.1.3. Fast Web Map Clustering Method

The point clustering algorithm of the web map pursues a fast clustering method that adapts to the front-end display of the web page, solving the difficulty of displaying point elements in the map severally [25]. Clustering algorithms commonly used are grid-based, distance-based, or grid- and distance-based point clustering methods [26]. Classical clustering algorithms were not considered because of their operational efficiency. Currently, all major web maps provide point-clustering APIs to display the corresponding point-clustering results at different display scales. The clustering methods used by the major web map vendors are basically similar in their approach. As there is no fixed name for this method, this paper uses ‘web map clustering’ to unify the name. To accommodate front-end rendering, unlike the grid-clustering algorithm described above, the web map uses a simpler method of point clustering with a grid. It only uses statistics of the sum of the elements within the grid based on the current grid position, and the time complexity will not exceed O(n). The distance-based method combines elements with smaller distances between points. In the loading sequence of points, when a later point has other prior points within the search distance, it belongs to the class of the nearest prior point. Otherwise, it is independent of the class [27]. However, it takes a long time to calculate the distance between points, so this method is seldom used in web maps. The point clustering algorithm, which is based on a combination of grid and distance, is currently the main web map clustering method. It is slightly slower than the grid-based method but only needs to be calculated once for each original point. Each element is combined with only the elements that overlap with the grid. The clustering points reflect the location information of the original point elements more accurately. The web map clustering method data were limited to point elements. It does not have a suitable evaluation metric for aggregation accuracy and is unable to perform multi-scale clustering of point sets with irregularly shaped distributions [28].

2.2. CFSFDP and Hierarchical Density Spanning Tree Structure

The basis of the CFSFDP clustering algorithm is the assumption that cluster centers are surrounded by neighbors with a lower local density and that they are at a relatively large distance from any point with a higher local density. There are two important quantities to describe each data point: the local density and the distance from the nearest point with high density.

Many studies have improved local density calculation methods to achieve better clustering effects. For example, counting the number of points within a specified radius (Radius Density, RD) is the density calculation method used in the original CFSFDP article. As the same density values for nearby elements lead to poor calculation results for the radius method, Rodriguez and Laio [8] used the Gaussian function to improve the density calculation method. Xie et al. [29] used a fuzzy weighted K-Nearest Neighbor (KNN) method to improve the density calculation method and solve the inaccurate clustering of the unbalanced distribution of points. Mehmood et al. [30] used the constrained Kernel Density Estimation (KDE) method as the basis for local density calculations to improve clustering effects. Liu et al. [31] applied the Shared-Nearest Neighbor (SNN) to simple datasets as well as complex datasets of multiple scales, cross-winding, of various densities, or of high dimensions. In summary, all of the above methods can be used to calculate the density values for specific scenes.

In the CFSFDP method, each point has a distance value that is measured by computing the minimum distance between this point and any other point with a higher density. The distance value of the point with the largest local density is the maximum distance value of any other point. The density and distance values of all points were generated into a two-dimensional decision diagram with density as the horizontal coordinate and distance as the vertical coordinate. To identify the clustering centers, many scholars extracted points with higher density and higher distance than other points [8,29,30,31]. The cluster centroids are prominent and can be delineated in the decision diagram. Thus, it is easy to extract the clustering centers. Many scholars have focused on choosing the correct number of cluster centers. Xu et al. [12] obtained clustering centers using the linear fitting method. It segments the data into intervals with large gradient changes. Zhou et al. [32] represented a cluster using a set of core objects instead of a single object and classified the rest of the objects based on their distribution information. Li et al. [33] obtained an appropriate number of clustering centers by selecting points with high density and distance values and merging those close to each other. It is an automatic clustering algorithm for determining the density of the clustering centers.

A hierarchical tree structure is an important intermediate result of the CFSFDP density algorithm. After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. Rodriguez and Laio [8] suggested that cluster assignment should be performed in a single step. Therefore, many scholars have studied the problem of assigning all remaining points, except the clustering centers, to obtain clustering results.

The hierarchical density-spanning tree structure is an additional product of CFSFDP. This is a network of relationships between elements obtained by retaining the vector when calculating the distance at each point. In other words, each point has a unique nearest point with a density greater than that of point. The point dataset must have a unique maximum density point (points with the same density value can be distinguished according to the storage order) so that all points will converge to the highest density point. As the connection structure between elements does not cross, similar to a tree, the CFSFDP has a hierarchical character.

3. Methodology

In this section, we propose the MSCHT algorithm for the multi-scale clustering of massive points. First, it analyzes the points of hierarchical distribution of different density functions. Second, it introduces a density hierarchical division method and generates tree-structure relationships from each point on this basis. Finally, the clustering results were obtained through tree pruning. The flow of the specific algorithm is shown in Figure 1. The details of the algorithm are shown in Algorithm 1.

Algorithm 1 MSCHT.

Input: dataset

X

, density calculation method Den

, display region B O X (l o n, l a t, H, W)

, cutting rate a
Output: result

C l u

= {C₁,C₂, …, C_m} (where m is the number of clusters)
1. Initialize dataset

X_{1}

;
2. Build KD Tree (

X_{1}

);
3.

D^{n \times n} = {d_{i j}}^{n \times n}

; // Build the distance matrix

D^{n \times n}

;
4.

ρ = D e n (D^{n \times n})

; // The density calculation method Den, such as

R D, K D E, K N N, S N N

5.

X_{2} =

Dropsortbydensity(

X_{1}, ρ

);
6.

T_{1} = {}

; //Build null hierarchical density spanning tree
7. for

i = 2; i < = n; i = i + 1

do
8.

Y_{i} = {j, ρ_{j} > ρ_{i}}

;
9.

δ_{i} = m i n_{j \in Y_{i}} (d_{i j})

;
10.

S u p e r p o i n t_{i} = j (d_{i j} = δ_{i})

; // Build i-j link
11.

T_{1}

.add(

(i, j) : d_{i j}

);
12. end
13.

T_{2} = D r o p s o r t b y d i s t a n c e (T_{1}, d_{i j})

; // Hierarchical density spanning tree
14.

T_{3}

= Intersect(

T_{2}

,

B O X

); // Get the links within the display region

B O X

15.

d_{c} = \max (H, W) \times a

;
16.

m = 0

;
17. for each link in

T_{3}

do
18. if

d_{i j}

>

d_{c}

then
19.

m = m + 1

;
20. i ∈ C_m;
21.

C l u

.add(C_m);
22. else then
23. C_t.contain(

j

)(

t < = m

);
24.

i \in

C_t;
25. end
26. end
27. if

m = = 0

then
28. C₁

=

{

i, i \in T_{3}

};
29.

C l u =

{C₁};
30. end
31. return

C l u

;

3.1. Density Calculation

In a cluster, we assume that points in the central area are more important than points in the edge area because the points in the central area more explicitly belong to this cluster. Therefore, points belonging to the same cluster are different in value owing to their different positions. The point with the highest value was regarded as the cluster center. Because cluster centers are commonly surrounded by points that belong to the same cluster, they may exhibit core polarization in terms of density (

K D E

),

K N N

(

S N N

), and other metric importance indicators and may be the maximum density value within the cluster. Reasonable density calculations can assist in the convergence of point elements toward the cluster center. After quantifying the importance of the elements, the high-value elements in the center of the class are obviously less than those in the periphery, and there is an obvious density difference between the elements. This density difference is one of the key basics of CFSFDP. The density (

D e n

) at each point is calculated as follows:

Den \in {R D, K D E, K N N, S N N, \dots \dots}

(1)

As an example, the

R D

is given by Equations (2) and (3):

ρ_{i} = \sum_{j}^{n} χ (d_{i j})

(2)

χ (x) = {\begin{matrix} 1, x \leq e p s \\ 0, x > e p s \end{matrix}

(3)

R D

is calculated using Equation (3).

χ (x)

equals 1 when other points are within radius

e p s

and 0 otherwise.

χ (x)

is defined for each

Den

and the exact formula is different. For example,

K D E

uses a Gaussian function to weigh the distance

e p s

.

K N N

collects and calculates the distance

e p s

to the nearest k points (‘k’ denotes the number of nearest neighbors, which is different from the K-means ‘k’).

Figure 2 shows the scatter plot of the five clusters. Color and size were mainly expressed based on the density value of each point. Generally, the number of points with low-density values is larger than that with high-density values. The density hierarchy tree can be reasonably generated only when the intraclass clustering core is accurately identified as having a high-density value.

3.2. Hierarchical Density Spanning Tree

Using Formula (4), we calculated the shortest distance from each element

i

to point

j

with a higher density as a measure.

δ_{i} = {\begin{matrix} \min_{j : ρ_{j} > ρ_{i}} (d_{i j}), ρ_{i} \neq \max (ρ) \\ \max_{j} (d_{i j}), ρ_{i} = m a x (ρ) \end{matrix}

(4)

δ_{i}

is measured by computing the minimum distance between point

i

and any other point with higher density. The distance

δ_{i}

between point

i

and the point with the largest density

\max (ρ)

adopted with the maximum length

d_{i j}

of all points of

j

.

A pointing connection is formed when each point

i

retains point

j

as its superior point. When all points possess a unique pointing connection, as shown in Figure 3, there is an obvious tree-like structure (this structure is a better measurement of the relationship between the elements). For example, the density of point A is greater than that of point B, and in the set of all points larger than point B, point A is the closest; thus, B belongs to A and is a child node of A. Similarly, C and D are child nodes of A. In this iteration, EFG and other points belong to B, and each point has a unique upper-level density high-value point, forming a density-based N-ary tree structure. The distance between A and B is further than the distance between B and E. At the same time, this study found that, in general, the distance between potential cluster center points is much larger than the distance between other points. Figure 3b shows a hierarchical density-spanning tree based on

R D

density.

By using the tree structure, we can not only get the connection affinity between elements but also obtain the inter-class relationship of the clustering results.

3.3. Cutting Strategy

The cutting strategy is a key step in achieving multi-scale clustering. Combining the “map display extent” with the “hierarchical density spanning tree” is currently the main issue. Based on the established tree-like element relationships, the expected clustering effect can be obtained by pruning the branches based on certain mathematical rules. Therefore, reasonable pruning is crucial to the clustering results. By drawing on the pruning schemes of other clustering algorithms with a tree structure, the following clustering pruning schemes can be summarized.

We take Minimum Spanning Tree (MST) and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) for reference, which also rely on the distance between points to rule on the separation between classes. As MSCHT also has a tree-link structure, links are cut when the distance is longer than a specific threshold. As the pruning strategies of MST and HDBSCAN are too cumbersome in terms of multi-scale, a simpler approach is required in this study. The distance threshold

d_{c}

is calculated as follows:

d_{c} = \max (H, W) \times a

(5)

where

H

is the map height and

W

is the map width. Both of H and W are variables that vary with the extent of the map display at different spatial scales.

a

is a fixed parameter for all spatial scales and is determined according to clustering visual effects, and is specified by the following example.

According to Figure 3, the number of tree links is lower for long distances and higher for short distances. The frequency of occurrence of each tree-link length in the data in Figure 3 was counted, and is shown in Figure 4. With the fixed display range, the

a

value determines the effect of clustering. By setting

a

in the range 0–50%, we find that clustering was best with

a

between 10–20%, in Figure 4b–e. This phenomenon is also reflected in other display scales. Therefore, to satisfy the visual effect of each display scale, it is recommended to use 10–20% as the

a

-value. Setting

a

-value is convenient for the user to recognize the differences between zone categories.

When

a

is fixed,

d_{c}

changes as the display range box (

H, W

) is scaled and affects the clustering results. When a point has a link length greater than

d_{c}

in the map extent, this single element forms its class.

4. Experiments and Assessments

4.1. Datasets

4.1.1. Synthetic Datasets

Synthetic datasets from the relevant CFSFDP density improvement method [8,28,29,30] and other clustering literature [34,35,36,37,38] were used to verify the clustering effect of the proposed algorithm. The details are shown in Table 1.

4.1.2. Real Datasets

Two datasets were used in this experiment: the first dataset included building points in Beijing and the second dataset included landslide hazard sites in Pingwu County, Sichuan. The former dataset has a large data volume and range. It was used to test the performance of MSCHT in terms of clustering speed and to verify the identification of local high-density areas. The latter has a smaller data volume and range, and is used to demonstrate the specific application of the hierarchical density spanning tree structure to obtain clustering results at different display scales. The specific dataset description is detailed in Table 2 and the data distribution is detailed in Figure 5.

4.2. Experiments on Synthetic Data

To initially test the clustering effect of MSCHT, synthetic data were used for validation at a single scale. In particular, by adopting different density strategies for the spatial distribution characteristics of different data, great results can be obtained. This is shown in Figure 6. In summary, in this study, we adapted the data to synthetic data scenarios, such as heterogeneous density or weak connectivity, by using different density strategies. This makes the method in this paper more extendable and adaptable to more data distribution scenarios.

4.3. Speed Comparison with Classical Clustering Algorithms

To compare the speed and accuracy differences between the classical clustering algorithm and web map clustering algorithm, K-means and DBSCAN algorithms were selected as representatives of the classical algorithm. Point clustering methods for the Leaflet, Google map, and Amap were selected as representatives of web map clustering algorithms. In terms of the evaluation of accuracy, as the original building data have no category attribute, classification based on administrative divisions cannot effectively distinguish category attribution. Therefore, 36 known classes were obtained by box selection sampling only in areas with high building density, such as the central districts of Beijing, Fengtai region, and Beiyuan region, as manual cluster samples for accuracy evaluation.

Based on preliminary observations of building point data in Beijing, it was found that there are approximately 1000 building points within a search radius of 1000 m in the smallest high-density building areas. The distance between the dense areas of the points is approximately 5000 m. Concurrently, Beijing has 16 municipal districts, and it is assumed that each district has about two clusters. The initial clustering parameters for each method are listed in Table 3. As the distribution state of the building points in Beijing is similar to Figure 5a or Figure 5b,

R D

is used as the density calculation method. In order for DBSCAN, MSCHT to accurately identify high density areas, the value of eps is assigned as the maximum inner tangent circle radius of the smallest high density sample area. Both MSCHT and DBSCAN use a search radius eps = 1000 m to calculate density. MSCHT uses dc = 5000 as the distance threshold to truncate inter-class relationships. DBSCAN assumes that there are approximately more than 500 points within the core point radius. K-means assumes that Beijing has 30 clusters. As level 8 shows the full view of Beijing, web map clustering collectively uses level 8 of the tiles as the initial state. During the initial clustering process, the MSCHT algorithm required a longer execution time than the other methods (47 m 34 s) compared with 1 m 3 s for K-means and 40 m 35 s for DBSCAN. K-means was faster to process because there was no need to calculate density. As the MSCHT algorithm requires the calculation of tree connections, the total time required for calculation is less than that of the DBSCAN calculation method. In particular, owing to the construction of a spatial index of points, the overall complexity of the algorithm is reduced to O (nlogn). Leaflet, Google, and Gaudet all use gridded data statistics calculation, so the speed are 1 m 2 s, 49 s, and 40 s, respectively (excluding the front-end data loading and rendering time). In the initial results, the MSCHT algorithm required more time than the other methods.

However, in general, clustering results differ at different scales. To observe the results from other spatial perspectives, it was necessary to adjust the parameters of the previous clustering by adjusting the display scale. The DBSCAN and K-means algorithms find it difficult to inherit the results of the previous clustering in the 1st transfer. That is, some of the step results are difficult to reuse; therefore, all the processes of the DBSCAN and K-means algorithms need to be recalculated after parameter adjustment. In contrast, the MSCHT algorithm completes the density calculation of the points and tree structure connection during the initial clustering process. This makes it unnecessary to calculate the density in the subsequent tuning process, thereby accelerating the processing speed of the clustering algorithm. In the subsequent clustering result acquisition, only the traversal pruning combination of all tree branch connections and adjustment of

d_{c}

values are required, and the complexity of the algorithm is only O(n). Similarly, the web map is based on a multi-scale grid of tree connections, and the corresponding results are obtained according to different zoom attributes. Therefore, the MSCHT algorithm (0.023 s) was much faster than DBSCAN (45 m 34 s) and K-means (1 m 22 s). This is comparable to web clustering methods in terms of the speed of obtaining clustering cores. This is slightly faster because the pruning operation of the MSCHT is performed in the background. In particular, the larger the value of

d_{c}

, the smaller the number of prunings is required and the smaller the number of classes, so the operation is faster, and conversely, the operation is slower.

To better verify the clustering effect of each method, we evaluated the clustering accuracy of high-density samples at three different scales, that is, large scale, medium scale, and small scale. After adjusting the parameters and scales several times, the best clustering results were obtained for each method. Of these, the MSCHT performs better in the test sample based on the known higher density of town areas. Table 4 shows the clustering accuracy of each algorithm at different scales. The results show that evaluation indicators, i.e., Adjusted Rand Index (ARI), Normalized Mutual Information (NMI) and Adjusted Mutual Information (AMI) of MSCHT were higher than that of the other methods at any scale. As the results of web-based clustering methods are similar, the clustering results of Leaflet are used as the main comparison.

In summary, the MSCHT algorithm has computational speed advantages in multi-scale clustering with the aid of hierarchical density-spanning trees. It effectively identifies the clustering of high-density aggregation areas of major points. The main drawback is that the computational process of the hierarchical density-spanning tree is time-consuming. To achieve a good-quality front-end web-map clustering effect, it is advisable to calculate the tree structure of the point elements in the back-end procedure.

4.4. Specific Application Effects of Multi-Scale Clustering

4.4.1. Multi-Scale Clustering Results for Beijing Building Points

Figure 7 shows the distribution of multi-scale clustering results for all building point data in Beijing. The information of the southeast corner of Beijing is obtained by scaling step by step to obtain the corresponding clustering results. As it is difficult to represent the clustering distribution pattern of all points by aggregating them, additional rendering of the clustering attribution of all points is made. Where the cutting distance

d_{c}

at each scale is equal to 16% of the current width, the results are obtained at each level in turn.

Figure 7a shows the first level of map with a scale of 1:1,250,000. We cut the tree-links with length larger than 20 km and obtained 23 clusters. The central region of Beijing has the most clustered points. The building points in other areas successively converge towards local areas with high density values, which achieves large-scale point aggregation. From Figure 7a, it can be seen that the boundary between classes is an irregular curve, which indicates that the method can identify irregular point clusters.

Figure 7b shows the second level of map with a scale of 1:500,000. The tree-links with length over 8 km were cut and 116 clusters were obtained. The clustering results in Figure 7b are refinements of the first clustering result compared with Figure 7a. Figure 7c,d corresponds to the clustering results at scales 1;250,000 and 1:125,000, respectively, which continuously split the class clusters at the previous level. The multi-scale aggregation effect is obtained.

In terms of the visual effect, MSCHT can effectively achieve the aggregation of multi-scale point elements. Unlike web-map clustering algorithms, there are no partitions or constraints on the grid boundaries. Therefore, building areas constrained by density-spanning trees can be better identified for irregularly distributed building point sets, such as towns and streets. Therefore, the ARI\NMI\AMI of MSCHT was higher than that of web map clustering.

4.4.2. Results of Hierarchical Density Spanning Tree

Owing to the large amount of data from Beijing, it cannot show the hierarchical density tree structure better. Therefore, the second experiment chose a smaller number of landslide hazard sites in Pingwu County, Sichuan, to illustrate the structure for multi-scale. The rules underlying the MSCHT clustering method can be illustrated more visually by the addition of tree line connections and heatmap illustrations. In this case, the cutoff distance for this map was 10% of the denominator of the map scale.

Figure 8 shows the multi-scale clustering results of landslide hazard points in Pingwu County. The black box range is the first level. This figure shows eight clusters, and each point was concentrated towards a localized high value of density. The pink cluster in Figure 8a is reclustered as the internal points in Figure 8b. The points in the upper half remain as a whole because the link distance has not yet reached the cut-off requirement, while the points in the lower half points are differentiated. Therefore, the upper half points are indicated by the blue box, and the tree structure is further partitioned into more microscopic clusters. In this case, each cluster is covered by a region with high local density, which is an ideal state of aggregation.

In particular, when large-scale disasters such as landslides and earthquakes occur, MSCHT can quickly perform multi-scale clustering based on the distribution of disaster sites and deploy multi-level disaster relief centers. For example, when the number of elements within a class is greater than a certain number, small sites are established in their clusters. When the scale increased, large sites were set up for large clusters. This can compensate for the lack of relief capacity of small sites. Simultaneously, it can identify high-density areas and provide more reasonable relief.

Additionally, the effect of the hierarchical density spanning trees can be understood more intuitively in the heat map. Figure 8 shows that trees are connected at shorter distances than clusters of the same class, and at longer distances between classes. The results obtained by cutting long tree-links are more in line with the spatial visualization of the heat map.

5. Conclusions and Future Work

This paper proposes an improved multi-scale clustering method based on the display range of the map by drawing on the hierarchical density-spanning tree structure of the CFSFDP algorithm. It obtains the corresponding class clusters based on the retained short-tree links by cutting excessively long tree connections at different display levels. Experiments show that: (1) it reduces the time complexity of the algorithm to O(n) owing to the creation of hierarchical density-spanning trees. This speeds up the clustering operation and meets the requirements of web maps to obtain multi-scale clustering results in real time. (2) It is not constrained by the grid and can effectively identify irregular regions in the geographic space where entities are denser and the accuracy of aggregation is higher. (3) The hierarchical density-spanning tree provides the basis for the aggregation of each element and reflects the differences between classes.

In the future, there should be an in-depth discussion of the impact of density calculations on the generation of tree structures and exploration of the best method for calculating density where elements are most clearly distinguished. Furthermore, tree connection patterns other than linear distance should be investigated, and graph theory introduced to construct more robust tree structures. Meanwhile, further research on GPU and distributed density computing techniques is needed to address MSCHT’s excessive time problem during initial clustering [39]. The visual rendering of this clustering on the web map side to expand its applications ought to be explored so as to continue learning about the complex processing of data at multiple spatial scales and to provide the basic tools for such research.

Author Contributions

Song Chen, Fuhao Zhang, Xizhi Zhao: Conceptualization, Methodology, Software. Song Chen, Shangqin Liu: Data curation, Writing—Original draft preparation. Song Chen: Visualization, Investigation. Zhiran Zhang, Siyi Yu, Agen Qiu: Supervision. Song Chen, Fuhao Zhang: Software, Validation. Zhiran Zhang, Siyi Yu, Xizhi Zhao: Writing—Reviewing and Editing. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by the National Key R&D Program of China [grant numbers 2019YFB2102503, 2019YFB2102500]; the Open Fund of Key Laboratory of Monitoring, Evaluation and Early Warning of Territorial Spatial Planning Implementation, Ministry of Natural Resources [grant number LMEE-KF2021002]; National Natural Science Foundation of China [grant number 42201331); Chinese Academy of Surveying and Mapping Basic Research Fund Program [grant number AR2111].

Conflicts of Interest

The authors declare no conflict of interest.

References

Nguyen, T.T.; Krishnakumari, P.; Calvert, S.C.; Vu, H.L.; van Lint, H. Feature extraction and clustering analysis of highway congestion. Transp. Res. Part C Emerg. Technol. 2019, 100, 238–258. [Google Scholar] [CrossRef]
Liu, Q.; Li, Z.; Deng, M.; Tang, J.; Mei, X. Modeling the effect of scale on clustering of spatial points. Comput. Environ. Urban Syst. 2015, 52, 81–92. [Google Scholar] [CrossRef]
Fürhoff, L. Rethinking the Usage and Experience of Clustering in Web Mapping. In International Conference on Human-Computer Interaction; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar] [CrossRef]
Wu, B.; Wilamowski, B.M. A Fast Density and Grid Based Clustering Method for Data With Arbitrary Shapes and Noise. IEEE Trans. Ind. Informatics 2016, 13, 1620–1628. [Google Scholar] [CrossRef]
Wang, W.; Yang, J.; Muntz, R. STING: A statistical information grid approach to spatial data mining. Vldb 1997, 97, 186–195. [Google Scholar]
Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. 1979, 28, 100–108. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. kdd 1996, 96, 226–231. [Google Scholar]
Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef] [Green Version]
Guan, J.; Li, S.; He, X.; Zhu, J.; Chen, J. Fast hierarchical clustering of local density peaks via an association degree transfer method. Neurocomputing 2021, 455, 401–418. [Google Scholar] [CrossRef]
Long, Z.; Gao, Y.; Meng, H.; Yao, Y.; Li, T. Clustering based on local density peaks and graph cut. Inf. Sci. 2022, 600, 263–286. [Google Scholar] [CrossRef]
Lv, X.; Ma, Y.; He, X.; Huang, H.; Yang, J. CciMST: A Clustering Algorithm Based on Minimum Spanning Tree and Cluster Centers. Math. Probl. Eng. 2018, 2018, 8451796. [Google Scholar] [CrossRef] [Green Version]
Gui, Z.; Peng, D.; Wu, H.; Long, X. MSGC: Multi-scale grid clustering by fusing analytical granularity and visual cognition for detecting hierarchical spatial patterns. Futur. Gener. Comput. Syst. 2020, 112, 1038–1056. [Google Scholar] [CrossRef]
Madhulatha, T.S. An overview on clustering methods. IOSR J. Eng. 2012, 2, 719–725. [Google Scholar] [CrossRef]
Carr, J.K.; Fontanella, S.A.; Tribby, C.P. Identifying American Beer Geographies: A Multiscale Core-Cluster Analysis of U.S. Breweries. Prof. Geogr. 2018, 71, 185–196. [Google Scholar] [CrossRef]
Das, A.; Waslander, S.L. Scan registration with multi-scale k-means normal distributions transform. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012. [Google Scholar]
Chen, R.; Zhao, S.; Liang, M. A Fast Multiscale Clustering Approach Based on DBSCAN. Wirel. Commun. Mob. Comput. 2021, 2021, 1–11. [Google Scholar] [CrossRef]
Nayak, J.; Naik, B.; Behera, H.S. Fuzzy C-Means (FCM) Clustering Algorithm: A Decade Review from 2000 to 2014. In Computational Intelligence in Data Mining; Springer: Delhi, India, 2015; Volume 2, pp. 133–149. [Google Scholar]
Ankerst, M.; Breunig, M.M.; Kriegel, H.P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. ACM Sigmod Rec. 1999, 28, 49–60. [Google Scholar] [CrossRef]
Rani, P. A survey on STING and CLIQUE grid based clustering methods. Int. J. Adv. Res. Comput. Sci. 2017, 8, 1510–1512. [Google Scholar]
Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. WIREs Data Min. Knowl. Discov. 2011, 2, 86–97. [Google Scholar] [CrossRef]
Sheikholeslami, G.; Chatterjee, S.; Zhang, A. WaveCluster: A wavelet-based clustering approach for spatial data in very large databases. VLDB J. 2000, 8, 289–304. [Google Scholar] [CrossRef]
Santhisree, K.; Damodaram, A. CLIQUE: Clustering based on density on web usage data: Experiments and test results. In Proceedings of the 2011 3rd International Conference on Electronics Computer Technology, Kanyakumari, India, 8–10 April 2011. [Google Scholar]
Cheng, C.-H.; Fu, A.W.; Zhang, Y. Entropy-based subspace clustering for mining numerical data. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 15–18 August 1999. [Google Scholar]
Cheng, L.; Connor, T.; Sirén, J.; Aanensen, D.M.; Corander, J. Hierarchical and Spatially Explicit Clustering of DNA Sequences with BAPS Software. Mol. Biol. Evol. 2013, 30, 1224–1228. [Google Scholar] [CrossRef] [PubMed]
Habibullayevich, G.K.; Chen, X.; Shin, H. Efficient Filtering and Clustering Mechanism for Google Maps. J. Adv. Manag. Sci. 2013, 1, 107–111. [Google Scholar] [CrossRef] [Green Version]
Beresnev, A.; Semenov, A.; Panidi, E. Hexagonal grids applied to clustering locations in web maps. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, 43, 435–440. [Google Scholar] [CrossRef]
Netek, R.; Brus, J.; Tomecka, O. Performance Testing on Marker Clustering and Heatmap Visualization Techniques: A Comparative Study on JavaScript Mapping Libraries. ISPRS Int. J. Geo-Inf. 2019, 8, 348. [Google Scholar] [CrossRef]
Xie, J.; Gao, H.; Xie, W.; Liu, X.; Grant, P.W. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors. Inf. Sci. 2016, 354, 19–40. [Google Scholar] [CrossRef]
Mehmood, R.; Zhang, G.; Bie, R.; Dawood, H.; Ahmad, H. Clustering by fast search and find of density peaks via heat diffusion. Neurocomputing 2016, 208, 210–217. [Google Scholar] [CrossRef]
Liu, R.; Wang, H.; Yu, X. Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Inf. Sci. 2018, 450, 200–226. [Google Scholar] [CrossRef]
Xu, J.; Wang, G.; Deng, W. DenPEHC: Density peak based efficient hierarchical clustering. Inf. Sci. 2016, 373, 200–218. [Google Scholar] [CrossRef]
Zhou, Z.; Si, G.; Sun, H.; Qu, K.; Hou, W. A robust clustering algorithm based on the identification of core points and KNN kernel density estimation. Expert Syst. Appl. 2022, 195, 116573. [Google Scholar] [CrossRef]
Tao, L.; Hongwei, G.; Shuzhi, S. Density peaks clustering by automatic determination of cluster centers. J. Front. Comput. Sci. Technol. 2016, 10, 1614–1622. [Google Scholar]
Peng, D.; Gui, Z.; Wang, D.; Ma, Y.; Huang, Z.; Zhou, Y.; Wu, H. Clustering by measuring local direction centrality for data with heterogeneous density and weak connectivity. Nat. Commun. 2022, 13, 5455. [Google Scholar] [CrossRef]
Veenman, C.J.; Reinders MJ, T.; Backer, E. A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1273–1280. [Google Scholar] [CrossRef] [Green Version]
Gionis, A.; Mannila, H.; Tsaparas, P. Clustering aggregation. Acm Trans. Knowl. Discov. Data 2007, 1, 4-es. [Google Scholar] [CrossRef] [Green Version]
Fu, L.; Medico, E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinform. 2007, 8, 3. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chang, H.; Yeung, D.-Y. Robust path-based spectral clustering. Pattern Recognit. 2008, 41, 191–203. [Google Scholar] [CrossRef]
Zhang, G.; Zhu, A.X.; Huang, Q. A GPU-accelerated adaptive kernel density estimation approach for efficient point pattern analysis on spatial big data. Int. J. Geogr. Inf. Sci. 2017, 31, 2068–2097. [Google Scholar] [CrossRef]

Figure 1. MSCHT algorithm flow chart.

Figure 2. Density distribution evaluation.

Figure 3. Schematic diagram of tree structure. (a) Tree structure diagram; (b) tree feature distribution map.

Figure 4. Tree-link length frequency distribution and result of cutting. (a) Frequency distribution diagram of tree-link length (the class interval is 0.4), the red dashed line is the value

a

for the

d_{c}

; (b–e) clustering results obtained for different

a

values.

Figure 4. Tree-link length frequency distribution and result of cutting. (a) Frequency distribution diagram of tree-link length (the class interval is 0.4), the red dashed line is the value

a

for the

d_{c}

; (b–e) clustering results obtained for different

a

values.

Figure 5. Experimental data distribution. (a) Distribution of building points in Beijing. (b) Distribution of landslide hazard points in Pingwu County, Sichuan.

Figure 6. Clustering results of synthetic data under different density strategies. (a) D31 dataset,

Den = K D E

. (b) Aggregation dataset,

Den = K N N

. (c) Flame dataset,

Den = K D E

. (d) R15 dataset,

Den = R D

. (e) Path-based1: spiral dataset,

Den = K N N

. (f) Path-based2 dataset,

Den = S N N

.

Figure 6. Clustering results of synthetic data under different density strategies. (a) D31 dataset,

Den = K D E

. (b) Aggregation dataset,

Den = K N N

. (c) Flame dataset,

Den = K D E

. (d) R15 dataset,

Den = R D

. (e) Path-based1: spiral dataset,

Den = K N N

. (f) Path-based2 dataset,

Den = S N N

.

Figure 7. Multi-scale clustering results of Beijing building point; (a–d) clustering results at different scales.

Figure 8. Multi-scale clustering results of landslide hazard points in Pingwu County. (a–c) clustering results and tree structure at different scales.

Table 1. Synthetic datasets.

Name	Source	Count	Classes
D31	[35]	3100	31
Aggregation	[36]	788	7
flame	[37]	240	2
R15	[35]	600	15
Path-based1 spiral	[38]	312	3
Path-based2	[38]	300	3

Table 2. Detailed description of datasets.

Dataset	Objects(n)	Boundary
Building points in Beijing	1,941,608	115.416827,39.442087: 117.508251,41.058964
Landslide hazard points in Pingwu County, Sichuan	2927	104.514362,32.396887: 104.560734,32.415562

Table 3. Clustering speed comparison table.

Steps	Initial Clustering			1st Transfer			2nd Transfer
Method	Time (s)	Parameters	Time Complexity	Time (s)	Parameters	Time Complexity	Time (s)	Parameters	Time Complexity
MSCHT	2854	$eps = 1000$ $d_{c}$ = 5000	nlogn	0.234	$d_{c}$ = 8000	n	0.193	$d_{c}$ = 10,000	n
K-means	63	$k$ = 30	kn	82	$k$ = 50	kn	142	$k$ = 60	kn
DBSCAN	2435	eps = 1000 minpts = 500	nlogn	2734	eps = 1000 minpts = 800	nlogn	2592	eps = 1000 minpts = 750	nlogn
Leaflet-webmap	62	Zoom = 8	n	0.582	Zoom = 9	n	0.344	Zoom = 10	n
Google-webmap	49	Zoom = 8	n	2.19	Zoom = 9	n	2.03	Zoom = 10	n
Amap-webmap	40	Zoom = 8	n	2.18	Zoom = 9	n	1.88	Zoom = 10	n

Table 4. Comparison of multi-scale optimal clustering accuracy.

	Methods	MSCHT	K-means	DBSCAN	Leaflet
Large scale (Municipal)	Parameters	$e p s$ $= 1000; d_{c}$ = 9000	$k$ = 60	$e p s$ $= 1000; m i n p t s$ = 750	Zoom = 8
	ARI	0.997437	0.989369	0.901756	0.878996
	NMI	0.989205	0.968583	0.958969	0.916621
	AMI	0.989197	0.968558	0.958931	0.916556
Medium scale (District)	Parameters	$d_{c}$ = 4000	$k$ = 360	$e p s$ $= 750; m i n p t s$ = 500	Zoom = 11
	ARI	0.991010	0.868969	0.925346	0.834260
	NMI	0.981287	0.887952	0.920372	0.871474
	AMI	0.981250	0.887793	0.920238	0.870970
Small scale (Street)	Parameters	$d_{c}$ = 2000	$k$ = 720	$e p s$ $= 500; m i n p t s$ = 100	Zoom = 13
	ARI	0.914856	0.880267	0.885048	0.792122
	NMI	0.856715	0.798256	0.863005	0.742880
	AMI	0.856438	0.797970	0.862923	0.741472

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, S.; Zhang, F.; Zhang, Z.; Yu, S.; Qiu, A.; Liu, S.; Zhao, X. Multi-Scale Massive Points Fast Clustering Based on Hierarchical Density Spanning Tree. ISPRS Int. J. Geo-Inf. 2023, 12, 24. https://doi.org/10.3390/ijgi12010024

AMA Style

Chen S, Zhang F, Zhang Z, Yu S, Qiu A, Liu S, Zhao X. Multi-Scale Massive Points Fast Clustering Based on Hierarchical Density Spanning Tree. ISPRS International Journal of Geo-Information. 2023; 12(1):24. https://doi.org/10.3390/ijgi12010024

Chicago/Turabian Style

Chen, Song, Fuhao Zhang, Zhiran Zhang, Siyi Yu, Agen Qiu, Shangqin Liu, and Xizhi Zhao. 2023. "Multi-Scale Massive Points Fast Clustering Based on Hierarchical Density Spanning Tree" ISPRS International Journal of Geo-Information 12, no. 1: 24. https://doi.org/10.3390/ijgi12010024

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Scale Massive Points Fast Clustering Based on Hierarchical Density Spanning Tree

Abstract

1. Introduction

2. Related Works

2.1. Multi Scale Clustering Methods

2.1.1. Classical Clustering Methods Based on Modified Parameters

2.1.2. Multi-Scale Clustering Methods Based on Hierarchical Grid

2.1.3. Fast Web Map Clustering Method

2.2. CFSFDP and Hierarchical Density Spanning Tree Structure

3. Methodology

3.1. Density Calculation

3.2. Hierarchical Density Spanning Tree

3.3. Cutting Strategy

4. Experiments and Assessments

4.1. Datasets

4.1.1. Synthetic Datasets

4.1.2. Real Datasets

4.2. Experiments on Synthetic Data

4.3. Speed Comparison with Classical Clustering Algorithms

4.4. Specific Application Effects of Multi-Scale Clustering

4.4.1. Multi-Scale Clustering Results for Beijing Building Points

4.4.2. Results of Hierarchical Density Spanning Tree

5. Conclusions and Future Work

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI