1. Introduction
With the rapid development of remote sensing technology and free data access, more and more land-cover maps have been produced via image classification analysis in recent decades [
1]. Such products are important for environmental monitoring applications, such as studies of water-related ecosystems, urban land expansion, the loss of cultivated land, and deforestation [
2]. However, different types of errors and uncertainties are encountered during the process of generating land-cover data; in the acquisition, processing, classification, and analysis of the data [
3,
4,
5]; and the impact of these errors directly affects the quality of the final product. Thus, an unbiased estimation of the accuracy of the land-cover products is necessary.
In previous research studies, accuracy assessment has been used to validate land-cover products and provide the user with a better understanding of the product quality [
6]. The result of the accuracy assessment can also help producers in improving the classifier, seek optimized classification features, or combine external data to improve the classification’s accuracy [
7]. The main process of accuracy assessment is quantifying the spatial and attribute consistency between the classification product and the reference data [
6]. According to the principles of statistics, several representative sample points can be chosen in geographic space, and corresponding reference data that can reflect the ground truth are selected [
8]. The reference labels can then be visually interpreted from very high-resolution data, such as aerial photographs or field measurement data [
9]. The implementation of statistically rigorous accuracy assessment can, thus, be achieved based on good practices [
10] and sampling-based estimation, reflecting the consistency of the map classification and reference data [
11]. An error matrix is a cross-tabulation of map classification labels against reference data labels. Its rows represent the map classification labels and the columns represent reference data labels. Then, the corresponding entries were used to calculate overall accuracy, producer’s accuracy, and user’s accuracy with standard deviations at different confidence intervals (95% confidence interval and 90% confidence interval were used mostly) [
12,
13,
14]. In a binary classification application, precision, recall, and F-score play important roles in calculating accuracies [
15].
Many past research studies were devoted to evaluating the accuracy of global land-cover products in a statistically rigorous manner [
10]. For example, Mayaux et al. [
16] conducted an assessment of the GLC2000 product by combining a confidence-building method and stratified random sampling, reporting an overall accuracy of 68.6%. Gong et al. [
17] produced and estimated the Finer Resolution Observation and Monitoring–Global Land Cover (FROM-GLC) product, divided the globe with hexagons, and selected five random samples from each hexagon; the support vector machine classifier produced the highest overall classification accuracy of 64.9%. The global burned area MODIS-MCD45 product was verified by Padilla et al. [
18] in 2008; stratified random sampling was used to select 102 sample Thiessen scene areas; for sample size allocation to strata based on burned-area extent, both the global accuracy and the accuracy for some of the terrestrial biomes were estimated. The temporal consistency of long time-series maps is currently an area of focus [
19]. Liu et al. [
20] developed a global 30 m impervious surface map, and a total of 11,942 sample points were random selected in 15 typical verification areas to verify their accuracy; its overall accuracy is 95.1% and kappa = 0.898. The Committee on Earth Observing Satellites (CEOS) has endorsed several activities regarding the calibration and validation of Satellite Data and provided recommendations on the validation of change maps (i.e., the collection of new, high quality, and multi-resolution reference data) in addition to providing spatially representative satellite measurements for validation [
21]. However, many users are primarily concerned with the spatial pattern of the land cover, and they are less concerned with the temporal dimension. Therefore, there is an urgent need to expand the accuracy estimation of multi-temporal land-cover products by validating both the spatial and temporal consistency.
Multi-temporal land-cover products cover the same spatial location and show time-series attribute changes [
22]. For the accuracy assessment of a single period, the cost and the time consumption can be very high. Single-period data evaluation can only reflect the data quality of each period, but multi-temporal land-cover data evaluation can not only provide the accuracy of a single period but also extract changing. However, the complexity of the change will significantly increase with more data in the temporal domain. It is necessary to design a reasonable stratified-sampling method by using changes in all periods.
During the sampling process, several representative samples within the region of interest were selected based on the theory of probability statistics sampling, which influences the estimation accuracy of the remote sensing products directly. Among all sampling designs, stratified random sampling is widely used, and it is defined as selecting a simple random sample from each stratum. Stratified sampling allows the existence of different accuracies in different strata. SSCE has high accuracy with respect to estimating classification accuracy when only a few sampling points exist. SSCE requires less sampling points than SS under the same tolerance of error [
23].
However, allocating the samples to each stratum in several time periods for multi-temporal land-cover products is a challenging task. For multi-temporal land-cover products with only two categories, such as change and no change, urban and non-urban, forest and non-forest, etc., the area of no change is typically larger, and the area of change is smaller. If the sample size is allocated according to the area of land cover, it will lead to a smaller sample size for the rare type. Methods for reasonably allocating the sample size for this type of data product need to be studied.
Different studies propose different principles to determine the sample sizes of the different strata, considering different objectives as well as the specified standard deviation contribution [
24,
25]. One principle for determining the allocation of the sample size is based on empirical rules. The standard deviation of the estimated user’s accuracy of change decreases with equal allocation. The proportional allocation method is dependent on the area of the different map classes. More samples are selected within the common class with a large area, while fewer samples are selected for the rare classes. This type of allocation of sample size is dependent on the empirical rules instead of a mathematical model, which greatly rely on the expert experience.
The land-cover types with a small proportion sometimes need more samples than that with a large proportion, achieving a more reliable assessment results [
4]. The area of change is small in the multi-temporal land cover, but it is important for users. Only a few research studies consider the spatially stratified sampling designs and sample size allocation for rare strata. For some rare change strata of interest, the reasonable allocation of sample size cannot be obtained using the empirical model.
The other principle is based on the variance of the different estimators, e.g., the overall accuracy, user’s accuracy, and producer’s accuracy. Neyman allocation involves allocating different sample sizes by minimizing the variance of the estimated overall accuracy [
24]. Cochran [
25] utilized a minimum variance estimator to obtain the allocation of sample sizes using stratifications by considering the accuracy of both the area of the reference class and the overall accuracy. Stehman [
24] obtained the optimal allocation of sample sizes using an objective function established by the sum of three variances (producer’s accuracy, user’s accuracy, and area estimation of a single class) and analyzed various sample allocation schemes for the error matrix. Since the three types of variances are complementary, the result of minimizing the objective function using a single indicator is biased [
26]. Thus, the ideal sampling design should follow the criterion of high-precision estimation so that all the estimators have a small standard deviation if no special indicator is provided.
In this study, we developed a new spatio-temporal stratified sampling for estimating the spatial accuracy of global land-cover products by considering the spatio-temporal characteristic and optimizing the sample allocation for each stratum. In practical applications, the temporal and spatial characteristics, i.e., a type of land-cover change and no change in acquisition periods, are used as the basis for stratified sampling. The sample units are spatially defined based on 30 m resolution pixels and temporally defined by the acquisition dates of the multi-temporal land-cover images. Because no product quality information is available before validating the results, the initial error matrix is obtained by interpreting a fraction of a sample, and the objective function is constructed to determine the stratified sample size based on minimizing the sum of the user’s accuracy variance, producer’s accuracy variance, and estimated area ratio variance of all stratum. Different from the previous studies [
24] for the allocation of sample size, the proposed algorithm selects no special stratum or a single class of primary interest, demonstrating that the accuracy estimators of all spatio-temporal stratum are considered equally important. We tested the proposed method with the ShangHai (SH) dataset [
27]. In addition, the spatio-temporal stratified sampling and optimal sample allocation methods were applied to a multi-temporal global urban land-cover product [
28]. Due to the spatial clustering characteristic of urban areas, we adopted the local pivotal method (LPM) for selecting well-spread samples and for improving the efficiency of accuracy estimations.
The main contributions of this paper are listed as follows:
- (1)
We propose a temporal stratification by a combination of land-cover types in three different dates in order to achieve reasonable stratified samplings.
- (2)
An optimal sample allocation is proposed with respect to the optimization of three types of variances of all strata.
- (3)
The proposed spatio-temporal stratified sampling is applied to the multi-temporal global urban land-cover dataset.
4. Discussion
In this article, we have proposed a spatio-temporal stratified sampling framework for multi-temporal global urban land-cover data, which is aimed at accurately evaluating single-period data accuracy and also two-period change and no-change type accuracy. In the sampling design, the probability sampling statistical model is used to calculate the sample size of the primary and secondary sampling units. The percentage sampling method has disadvantages, such as strict large batches and loose small batches, and it cannot be applied well for the determination of sample size. Numerous studies have failed to consider how to reasonably determine the sample size when evaluating the accuracy of land-cover products. However, the sample size allocation in the sample design component of accuracy assessment is an important part of improving the accuracy [
24]. In the proposed approach, the stratification is determined by global urban ecological regions and spatio-temporal changes, and a two-stage sampling framework is established. During first-stage sampling, a regional stratified random sampling design is used to allocate the samples to the strata according to the proportion of the urban area extent of the global urban land-cover product. Spatio-temporal change stratification provides the technical support for dynamic monitoring of the product. In the second stage, based on the characteristics of the multi-temporal urban land, a method for determining the stratified sample size with an objective function is proposed. Considering the spatial distribution characteristics of urban land cover, the sample pixels are selected by LPM. The proposed new accuracy validation methodology for multi-temporal global urban land-cover data could provide a technical reference for the subsequent accuracy evaluation of similar products.
The accuracy estimations conducted based on a sample of reference data provided valuable information on both the single-date maps and multi-temporal global urban land-cover data for the change and no-change types. When only considering no-change types (000 and 111) and urban expansion (001 and 011) as the strata, it was found that the overall accuracy gradually decreased from 2000 to 2010 (
Figure 14a). The accuracy evaluation results for the eight strata considering spatio-temporal changes reached the highest in 2000 and the lowest in 2010 (
Figure 10a). An explanation for this is that the proportion of land area occupied by the change strata is small, but it still has an impact on the result of the stratified accuracy assessment. From 2000 to 2010, it was found that the area of urban land had been increasing, and its classification accuracy was lower than that for the non-urban land, which in turn affects the overall accuracy of the product. In future research, a variety of accuracy indicators could be used to evaluate the accuracy of a single type of data product, and it is not limited to the overall accuracy, producer’s accuracy, and user’s accuracy considered in this study. In this paper, the multi-temporal land-cover accuracy assessment experiment uses only three periods of data for validation, and it is expected that more than three periods of data can be used to verify the method in the future. For the method of assigning sample size based on objective function optimization, it is necessary to know the variance of the data in advance. However, in the sampling design stage, this information is not known. It is expected that, in the future, it can be developed with other information of map, such as the area of each stratum in the classification map. Reasonable models for reducing the work intensity caused by pre-sampling will be developed.
There are many factors that influence the accuracy of visual image interpretation: inconsistency in the definition of urban land data [
49], misclassification in global urban land data, spatial misalignment between the classification map and reference data, different interpreters labeling the same sample with different reference labels [
50], and reference data not being free of errors [
51]. Validation procedures need to be improved in future studies.