DESTformer: A Transformer Based on Explicit Seasonal–Trend Decomposition for Long-Term Series Forecasting

Wang, Yajun; Zhu, Jianping; Kang, Renke

doi:10.3390/app131810505

Open AccessArticle

DESTformer: A Transformer Based on Explicit Seasonal–Trend Decomposition for Long-Term Series Forecasting

by

Yajun Wang

¹

,

Jianping Zhu

²

and

Renke Kang

^1,*

¹

School of Mechanical Engineering, Dalian University of Technology, Dalian 116024, China

²

School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(18), 10505; https://doi.org/10.3390/app131810505

Submission received: 13 July 2023 / Revised: 12 September 2023 / Accepted: 15 September 2023 / Published: 20 September 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Seasonal–trend-decomposed transformer has empowered long-term time series forecasting via capturing global temporal dependencies (e.g., period-based dependencies) in disentangled temporal patterns. However, existing methods design various auto-correlation or attention mechanisms in the seasonal view while ignoring the fine-grained temporal patterns in the trend view in the series decomposition component, which causes an information utilization bottleneck. To this end, a Transformer-based seasonal–trend decomposition methodology with a multi-scale attention mechanism in the trend view and a multi-view attention mechanism in the seasonal view is proposed, called DESTformer. Specifically, rather than utilizing the moving average operation in obtaining trend data, a frequency domain transform is first applied to extract seasonal (high-frequency) and trend (low-frequency) components, explicitly capturing different temporal patterns in both seasonal and trend views. For the trend component, a multi-scale attention mechanism is designed to capture fine-grained sub-trends under different receptive fields. For the seasonal component, instead of the frequency-only attention mechanism, a multi-view frequency domain (i.e., frequency, amplitude, and phase) attention mechanism is designed to enhance the ability to capture the complex periodic changes. Extensive experiments are conducted on six benchmark datasets covering five practical applications: energy, transportation, economics, weather, and disease. Compared to the state-of-the-art FEDformer, our model shows reduced MSE and MAE by averages of 6.5% and 3.7%, respectively. Such experimental results verify the effectiveness of our method and point out a new way towards handling trends and seasonal patterns in long-term time series forecasting tasks.

Keywords:

long-term time series prediction; transformer; seasonal–trend decomposition

1. Introduction

Long-term time series prediction refers to the prediction of sequence changes over a longer period of time based on historical data, like predicting 24 points or more, which is indicated in Informer [1], Autoformer [2], and Fedformer [3]. It has widespread applications in fields such as electricity forecasting [1,2], traffic flow prediction [4,5,6], inventory control [7], and healthcare management [8,9]. For example, in the energy sector, long-term forecasting is used to optimize the operation and management of the power grid, improving energy efficiency and reliability. However, high nonlinearity, long-term temporal dependency, and entangled multi-scale temporal components (e.g., trend and seasonality) make long-term forecasting a very challenging task.

First, learning multi-scale temporal dependency is nontrivial. Guided by the idea of time series decomposition, long sequences can be decoupled into more expressive seasonal and trend components. In traditional transformer-based methods [10], such as Informer [1], Reformer [11], and Preformer [12], attention values are often calculated based on the position-aware data points from the original time series. Due to the high noise and complexity in long-term forecasting tasks, these methods often produce suboptimal results. The latest methods for constructing seasonal–trend attention are still limited to single-perspective frequency domain learning [3] and fixed-length subsequence learning [2]. We believe that multi-view seasonal attention learning and variable-length sub-trend attention learning can flexibly capture more potential information in long sequences, thereby improving the predictive and generalization capabilities of the model.

Second, on the basis of time series decomposition, effective learning of seasonal–trend representations becomes more important [13]. Existing methods that combine time series decomposition with the transformer architecture [10], such as Autoformer [2] and FEDformer [3], have demonstrated strong predictive capabilities on datasets with strong seasonality and weak noise disturbances. Although these methods combine progressive decomposition, they still cannot effectively distinguish between seasonal and trend representations. At the same time, they often focus more on learning seasonal components and neglect long-term trend fluctuations, which greatly increases the limitations of these methods. Therefore, we believe that effective and targeted seasonal–trend representation learning can maximize the model’s long-sequence prediction capabilities.

To address all of the above challenges, a transformer model based on seasonal–trend decomposition is proposed, called DESTformer. Firstly, DESTformer effectively decouples and denoises complex sequences through frequency domain transform. Next, for the seasonal component, a multi-view attention mechanism (MVI-Attention) is proposed to replace the traditional self-attention mechanism for capturing complex periodic changes. MVI-Attention simultaneously calculates self-attention from three perspectives, i.e., frequency, amplitude, and phase, and then completes the conversion to the time domain through discrete Fourier inverse transformation. For the trend component, a multi-scale attention mechanism (MSC-Attention) is proposed to replace the traditional self-attention mechanism for capturing sub-trends under different receptive fields. MSC-Attention extracts sub-trends through one-dimensional convolution with multi-scale receptive fields and completes the attention aggregation of sub-trends by calculating the correlation coefficient between sequences. Finally, under the action of fast Fourier transform and sampling, DESTformer reduces the computational cost of the transformer from quadratic complexity to linear complexity.

To sum up, our main contributions are as follows:

A transformer architecture based on seasonal–trend decomposition is proposed that can effectively decouple complex long sequences and learn representations of seasonal and trend components in a targeted manner.
A multi-view attention mechanism (MVI-Attention) is proposed that can perform holistic modeling from multiple perspectives in the frequency domain to capture important periodic structures in time series.
A multi-scale attention mechanism (MSC-Attention) is proposed to enhance information utilization in the trend view via the modeling of variable-length sub-trends, thus learning information-rich trend representations.
Extensive experiments are conducted on six benchmark datasets in multiple domains (energy, transportation, ecology, weather, and disease). Experimental results show that DESTformer improves 6.0% and 4.8% over state-of-the-art methods in multivariate and univariate long-term time series prediction tasks, respectively.

2. Related Work

2.1. Long-Term Time Series Forecasting

Time series prediction tasks aim at forecasting future time series in the prediction window, given historical time series data in the conditioning window. Long-term time series forecasting is characterized by the large length of predicted series. Mainstream time series prediction models could be divided into traditional statistical methods and machine learning-based methods. Traditional statistical methods mainly include autoregressive methods, such as ARIMA [14,15], and additive models, such as Holt-Winters [16] and Prophet [17]. In particular, Holt-Winters [16] and Prophet [17] capture versatile temporal patterns (e.g., trends, seasonality, and randomness) to better model nonlinearities. These methods have strong explainability and are suitable for dealing with relatively stable and regular time series data. Meanwhile, these methods suffer from some limitations, such as sensitivity to outliers and missing values, difficulty in dealing with nonlinear and complex time series data, and difficulty in integrating other relevant information, such as timestamp information. In particular, when applied in the long-term forecasting tasks, the aforementioned statistical methods fail to capture the reliable dependencies.

In recent years, transformers based on self-attention mechanisms [10] has shown powerful capabilities in sequential data, such as natural language processing [18], audio processing [19], and even computer vision [20]. However, in long-term forecasting tasks, due to the quadratic complexity of sequence length L in memory and time, applying self-attention mechanisms to long-term time series prediction is computationally expensive. LogTrans [21] introduced local convolution into transformer and proposed the LogSparse attention mechanism to select exponentially growing intervals of time steps, reducing the complexity to O(LlogL). Reformer [11] used a local-sensitive hashing (LSH) attention mechanism to reduce the complexity to O(LlogL). Informer [1] extended transformer with the ProbSparse 2 attention mechanism based on KL-divergence to achieve O(LlogL) complexity. Nevertheless, it is worth noting that these methods are based on ordinary transformers and attempt to change self-attention mechanisms into sparse versions. They still followed the pointwise dependence modeling principle. Autoformer [2] decomposed complex time series into seasonality and trendiness, and used the autocorrelation mechanism of the sequence to capture reliable temporal dependencies; FEDformer [3] also used a similar decomposition idea to complete the self-attention calculation for seasonal components in the frequency domain. In this paper, the frequency domain transform technique for disentangling seasonal and trend components is based on the inherent seasonality and trendiness of time series.

2.2. Time Series Decomposition

As a standard method for time series analysis, time series decomposition [22] decomposes time series into several different levels of representation, each of which can represent a predictable potential category and has mainly been used to explore historical changes over time. For prediction tasks, decomposition was usually used as a preprocessing step for historical sequences before predicting future sequences [23,24,25], such as trend–seasonal decomposition in Prophet [26], basis expansion in N-BEATS [27], and matrix decomposition in DeepGLO [28]. However, such a reprocessing operation was limited by the simple decomposition of historical sequences and ignored the hierarchical interaction between underlying patterns of long-term future sequences. COST [29] utilized contrastive representation learning that was aware of seasonality and trends. LaST [13] decomposed seasonal–trend representations in latent space based on variational inference, and supervised and separated representations from the perspective of self and input reconstruction to achieve optimal performance. However, traditional methods used the moving average operation to extract trend features, which resulted in weak robustness to noise. In this paper, we explored the idea of decomposition from a new perspective. Specifically, we mapped the time series to the frequency domain and then separated the high-frequency part as the seasonal component and the low-frequency part as the trend component through frequency domain masking. At the same time, for the high-frequency part, we filtered the Top-K amplitudes corresponding to the frequency to complete denoising, making it more robust.

3. Methodology

In this section, a detailed description of the DESTformer architecture is provided. As mentioned earlier, long-term forecasting tasks involve complex temporal patterns in sequences. To effectively address this issue, a frequency domain transform module is used to decompose the original sequence into seasonal and trend components for modeling fine-grained temporal patterns. In addition, MVI-Attention and MSC-Attention are designed to capture the representations of seasonal and trend components respectively, thereby achieving accurate prediction.

3.1. Problem Definition

First, the problem definition for long-term forecasting is provided. Given a sequence

X_{1 : T_{x}} = {x_{1}, \dots, x_{T_{x}} | x_{t} \in R^{K}}

with K length

T_{x}

, the goal is to predict a sequence

Y = y_{1}, \dots, y_{T_{y}} | y_{t} \in R^{K}}

of length

T_{y}

. A seasonal representation

H_{S}

and a trend representation

H_{T}

are learned for the desired predicted sequence

\hat{Y}

. Given the learned representations of the seasonal and trend parts,

P (Y | H_{S}, H_{T})

is ultimately modeled.

3.2. DESTformer Architecture

In this section, a detailed description of the overall architecture of DESTformer is provided as shown in Figure 1. Combining the idea of time series decomposition, improvements are made to transformer, which includes a frequency domain transform module, a multi-view attention mechanism, a multi-scale attention mechanism, and corresponding encoders and decoders.

3.2.1. Frequency Decomposition

Compared to directly extracting seasonal and trend features on the original sequence (e.g., COST [29]), the approach of first decomposing and then extracting targeted features can effectively reduce interference from other features. It has been widely utilized in various data-denoising and pattern-filtering tasks [30,31]. In long-term forecasting problems, time series decomposition can learn complex temporal representations. Unlike traditional methods that obtain trend components through fixed window moving averages, a new time series decomposition method is used that maps the sequence to the frequency domain and takes high frequencies as seasonal components and low frequencies as trend components. Compared to traditional decomposition methods, the effect of the frequency domain transform is more obvious [30,31]. It effectively avoids the overall impact of outliers’ on-trend components in traditional methods:

{\hat{X}}_{S}, X_{T} = F^{- 1} (F (X) [: ξ], F (X) [ξ :]),

(1)

where

F

denotes the FFT and

F^{- 1}

is its inverse,

F (X) \in R^{I \times K}

, where

I = ⌊ T_{x} / 2 ⌋ + 1

and

ξ = 3

. At this point, the seasonal component

{\hat{X}}_{S}

contains a large amount of noise. In long-term forecasting tasks, the presence of noise often reduces the generalization ability of the model. Therefore, further denoising of the seasonal component is performed by selecting the Top-K frequencies corresponding to the amplitude to obtain the final seasonal component

X_{S}

.

3.2.2. Model Inputs

The input to the encoder is the past sequence X of length

T_{x}

. Consistent with Autoformer, the sequence X is decomposed into seasonal component

X_{S}

and trend component

X_{T}

through the frequency domain transform. The latter half of the seasonal component sequence is concatenated with a zero vector of length

T_{y}

along the time dimension as the input to the seasonal component in the decoder. The latter half of the trend component sequence is concatenated with a sequence mean vector of length

T_{y}

along the time dimension as the input to the trend component in the decoder. The input to the encoder is represented as

X_{e n}

, the seasonal component input to the decoder is denoted as

X_{d e s}

, and the trend component input to the decoder is represented as

X_{d e t}

. Mathematically, we have

\begin{matrix} X_{e n s}, X_{e n t} = F r e D e c o m p (X_{e n \frac{T_{x}}{2} : T_{x}}), \\ X_{d e s} = C o n c a t (X_{e n s}, X_{0}), \\ X_{d e t} = C o n c a t (X_{e n t}, X_{M e a n}) . \end{matrix}

(2)

3.2.3. Encoder

In the encoder, representation learning [13] and time series decomposition [27] are used to extract representations for seasonal and trend components, respectively. Suppose there are N encoder layers. The l-th encoder layer is represented as

X_{e n}^{l} = E n c o d e r (X_{e n}^{l - 1})

. The detailed process of each layer is represented as

\begin{matrix} S_{e n}^{l, 1} = M V I - A t t e n t i o n (X_{e n}^{l - 1}), \\ T_{e n}^{l, 1} = M S C - A t t e n t i o n (X_{e n}^{l - 1} - S_{e n}^{l, 1}), \\ S_{e n}^{l, 2} = S_{e n}^{l, 1} + F e e d F o r w a r d (S_{e n}^{l, 1}), \\ T_{e n}^{l, 2} = T_{e n}^{l, 1} + F e e d F o r w a r d (T_{e n}^{l, 1}), \\ X_{e n}^{l} = S_{e n}^{l, 2} + T_{e n}^{l, 2}, \end{matrix}

(3)

where

F e e d F o r w a r d ()

is composed of stacked one-dimensional convolutional and linear layers. In the last encoder layer, the seasonal representation

S_{e n}^{l, 2}

and trend representation

T_{e n}^{l, 2}

are no longer added together but directly input to the decoder. Detailed descriptions of MVI-Attention and MSC-Attention are provided in Section 3.3 and Section 3.4, respectively. They can replace self-attention to extract seasonal and trend representations, respectively.

3.2.4. Decoder

In the decoder, a frequency domain transform module is introduced to further obtain clearer seasonal–trend representations. Suppose there are M decoder layers. In each decoder layer, cross MVI-Attention and cross MSC-Attention are first used to fuse the output representation of the encoder with the input representation of the decoder. Then, two interactive frequency domain decomposition modules are used to further complete time series decomposition and aggregation. Finally, seasonal–trend representations are learned through a feedforward network and residual connection. Formally, the l-th decoder layer is represented as

X_{d e}^{l} = D e c o d e r (X_{d e}^{l - 1}, X_{e n}^{N})

. The detailed process of each decoder layer is represented as

\begin{matrix} S_{d e s}^{l, 1}, T_{d e s}^{l, 1} = F r e D e c o m p (M V I - A t t e n t i o n (S_{d e}^{l - 1, 2}, S_{e n}^{N, 2})), \\ S_{d e t}^{l, 1}, T_{d e t}^{l, 1} = F r e D e c o m p (M S C - A t t e n t i o n (T_{d e}^{l - 1, 2}, T_{e n}^{N, 2})), \\ S_{d e}^{l, 1} = S_{d e s}^{l, 1} + S_{d e t}^{l, 1}, \\ T_{d e}^{l, 1} = T_{d e s}^{l, 1} + T_{d e t}^{l, 1}, \\ S_{d e}^{l, 2} = S_{d e}^{l, 1} + F e e d F o r w a r d (S_{d e}^{l, 1}), \\ T_{d e}^{l, 2} = T_{d e}^{l, 1} + F e e d F o r w a r d (T_{d e}^{l, 1}), \end{matrix}

(4)

where

S_{d e}^{0, 2} = X_{d e s}

and

T_{d e}^{0, 2} = X_{d e t}

. In contrast to the encoder, seasonal and trend representations are only added in the last layer

X_{d e}^{M} = S_{d e}^{M, 2} + T_{d e}^{M, 2}

, and the final prediction is made through a linear layer in the decoder.

3.3. Multi-View Attention

Extracting periodic fluctuations from the seasonal component sequence is particularly important. Through FFT [32], the seasonal component is mapped to the frequency domain represented by complex values. After mapping the time series to the frequency domain, a time series can be completely represented by three attributes—frequency, amplitude, and phase—which reflect different characteristics of the periodic fluctuations of the sequence. Amplitude usually reflects the maximum distance that a sequence deviates from its equilibrium position at a certain moment, while phase represents the different states of a periodic sequence at different moments. By mapping the learned seasonal representation from the time dimension to the frequency domain,

F \in R^{I \times D}

is obtained, where D represents the feature dimension. The real and imaginary parts of F are represented as

F_{r}

and

F_{i}

, respectively. Amplitude and phase are represented by

A ()

and

Φ ()

, respectively, and can be described as

\begin{matrix} A (F) = \sqrt{F_{r}^{2} + F_{i}^{2}}, \\ Φ (F) = arctan (\frac{F_{i}}{F_{r}}) . \end{matrix}

(5)

The inputs to MVI-Attention, i.e., queries, keys, and values, are denoted as

q_{s} \in R^{T_{x} \times D}

,

k_{s} \in R^{T_{x} \times D}

, and

v_{s} \in R^{T_{x} \times D}

, respectively. They are mapped to the frequency domain and combined with the sampling strategy:

Q_{s}, K_{s}, V_{s} = S e l e c t (F (q_{s}, k_{s}, v_{s})) .

(6)

Further, amplitude and phase representations corresponding to

Q_{s}

,

K_{s}

, and

V_{s}

, are obtained, respectively. At the same time, similar attention mechanisms are leveraged in three other perspectives, including frequency, amplitude, and phase. Formally,

\begin{matrix} f = σ (Q_{s} \cdot K_{s}^{⊤}) \cdot V_{s}, \\ A = σ (A (Q_{s}) \cdot A (K_{s}^{⊤})) \cdot A (V_{s}), \\ ϕ = σ (Φ (Q_{s}) \cdot Φ (K_{s}^{⊤})) \cdot Φ (V_{s}), \end{matrix}

(7)

where

σ

is the activation function, such as Softmax or tanh. Finally, iDFT is applied to obtain the seasonal representation:

\begin{matrix} M V I - A t t e n t i o n (q_{s}, k_{s}, v_{s}) = \sum_{i = 1}^{I} A_{i} [cos (2 π f_{i} j + ϕ_{i}) + cos (2 π {\bar{f}}_{i} j + {\bar{ϕ}}_{i})], \end{matrix}

(8)

where

A_{i}

and

ϕ_{i}

represent the amplitude and phase at the i-th frequency, respectively, and

{\bar{f}}_{i}

and

{\bar{ϕ}}_{i}

represent the corresponding conjugate frequency and amplitude.

3.4. Multi-Scale Attention

For the trend component, sub-trends with different receptive fields often have a significant impact on future trends. The existing research is often limited to the learning of fixed-length trends. Under this condition, choosing an appropriate lookback window often becomes a critical issue. A small window can lead to underfitting, while a large model can lead to overfitting problems. A direct solution is to optimize this hyperparameter through the grid search method [33], but this method is computationally expensive. Therefore, a multi-scale autoregressive mixture is used to adaptively capture sub-trends with different receptive fields. The size of the j-th convolutional kernel is represented as

g_{j}

. The inputs to MSC-Attention, queries, keys, and values, are denoted as

q_{t} \in R^{T_{x} \times D}

,

k_{t} \in R^{T_{x} \times D}

, and

v_{t} \in R^{T_{x} \times D}

, respectively. Unlike traditional self-attention mechanisms, a mean convolutional kernel

g_{q} = \frac{1}{J} \sum_{j = 1}^{J} g_{j}

is used to obtain the query vector

Q_{t}

:

Q_{t} = C a u s a l C o n v (q_{t}, g_{q}) .

(9)

Further,

K_{t}

and

V_{t}

can be represented as

\begin{matrix} K_{t}^{j} = C a u s a l C o n v_{k} (k_{t}, g_{j}), \\ V_{t}^{j} = C a u s a l C o n v_{v} (v_{t}, g_{j}) . \end{matrix}

(10)

After capturing sub-trends with different receptive fields, softmax is used for activation. Finally, the trend representation is generated as follows:

M S C - A t t e n t i o n (q_{t}, k_{t}, v_{t}) = s o f t m a x (Q \cdot K^{⊤}) \cdot V,

(11)

where softmax is used for activation.

The entire process of the algorithm is summarized in Algorithm 1.

3.5. Complexity Analysis

In DESTformer, MVI-Attention is used to capture the periodic fluctuations of the seasonal term, while MSC-Attention is used to capture the long-term changes of the trend term. For a sequence of length L, in MVI-Attention, FFT is used to effectively reduce the time complexity to O(LlogL) [2]. On this basis, we also leverage a sampling strategy to effectively reduce the time and memory complexity to O(L) [3]. In MSC-Attention, we use one-dimensional convolution to extract J different sub-trends. Since the time complexity of one-dimensional convolution when encoding time series is O(L) and J is a constant, the final time complexity of MSC-Attention is

O (J^{2} L) = O (L)

, and the memory complexity of MSC-Attention combined with the sampling strategy is also O(L). In summary, DESTformer achieves an O(L) time and memory complexity. In Table 1, we summarize the comparison of time complexity and memory usage for training and inference steps.

Algorithm 1 Overall DESTformer procedure.

Input: Input time series X in the conditioning window with length $T_{x}$ , prediction window length $T_{y}$ , data dimension K, hidden state embedding length D, the number of encoder layers N, the number of decoder layers M.
Output: Predicted results $\hat{Y}$

1:: Use the uniform distribution to initialize model parameters $θ \sim U (- 1, 1)$ .
2:: $X_{e n}^{0} = X$
3:: $X_{e n s}, X_{e n t} = F r e D e c o m p (X_{e n \frac{T_{x}}{2} : T_{x}})$
4:: $X_{d e s} = C o n c a t (X_{e n s}, X_{0})$ ;
5:: $X_{d e t} = C o n c a t (X_{e n t}, X_{M e a n})$
6:: /* Encoder */
7:: for $l = 1, 2, \dots, N$ do
8:: $S_{e n}^{l, 1} = M V I - A t t e n t i o n (X_{e n}^{l - 1})$
9:: $T_{e n}^{l, 1} = M S C - A t t e n t i o n (X_{e n}^{l - 1} - S_{e n}^{l, 1})$
10:: $S_{e n}^{l, 2} = S_{e n}^{l, 1} + F e e d F o r w a r d (S_{e n}^{l, 1})$
11:: $T_{e n}^{l, 2} = T_{e n}^{l, 1} + F e e d F o r w a r d (T_{e n}^{l, 1})$
12:: $X_{e n}^{l} = S_{e n}^{l, 2} + T_{e n}^{l, 2}$
13:: end for
14:: $S_{d e}^{0, 2}, T_{d e}^{0, 2} = X_{d e s}, X_{d e t}$
15:: /* Decoder */
16:: for $l = 1, 2, \dots, M$ do
17:: $S_{d e s}^{l, 1}, T_{d e s}^{l, 1} = F r e D e c o m p (M V I - A t t e n t i o n (S_{d e}^{l - 1, 2}, S_{e n}^{N, 2}))$
18:: $S_{d e t}^{l, 1}, T_{d e t}^{l, 1} = F r e D e c o m p (M S C - A t t e n t i o n (T_{d e}^{l - 1, 2}, T_{e n}^{N, 2}))$
19:: $S_{d e}^{l, 1} = S_{d e s}^{l, 1} + S_{d e t}^{l, 1}$
20:: $T_{d e}^{l, 1} = T_{d e s}^{l, 1} + T_{d e t}^{l, 1}$
21:: $S_{d e}^{l, 2} = S_{d e}^{l, 1} + F e e d F o r w a r d (S_{d e}^{l, 1})$
22:: $T_{d e}^{l, 2} = T_{d e}^{l, 1} + F e e d F o r w a r d (T_{d e}^{l, 1})$
23:: end for
24:: $X_{d e}^{M} = S_{d e}^{M, 2} + T_{d e}^{M, 2}$
25:: $\hat{Y} = M L P (X_{d e}^{M})$

4. Experiments

To evaluate the proposed DESTformer model, a series of experiments are designed to compare it with state-of-the-art methods for long-term forecasting. In addition, ablation studies are conducted to investigate the roles and effects of each module in the model. Finally, efficiency analysis and T-SNE [35] representation visualization experiments are performed.

4.1. Experimental Settings

4.1.1. Datasets

To validate the long-term forecasting capability of the DESTformer, experiments are conducted on six real-world datasets. (1) The EET dataset (Electricity Transformer Temperature Dataset) (https://github.com/zhouhaoyi/ETDataset (accessed on 12 July 2023)): This dataset is commonly used for long sequence time series prediction and contains data from two different regions in China, recorded at 2 h, 1 h, and 15 min intervals from July 2016 to July 2018. Each data point includes the oil temperature and six electricity load indicators. (2) The Electricity dataset (https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 (accessed on 12 July 2023)): This dataset contains hourly electricity consumption data for 321 customers between 2012 and 2014. (3) The Exchange dataset [36]: This dataset contains daily exchange rates for eight different countries from 1990 to 2016. (4) The Traffic dataset (http://pems.dot.ca.gov (accessed on 12 July 2023)): This dataset is a collection of hourly data from the California Department of Transportation and includes road occupancy rates measured by different sensors on highways in the San Francisco Bay Area. (5) The Weather dataset (https://www.bgc-jena.mpg.de/wetter/ (accessed on 12 July 2023)): This dataset contains 21 meteorological indicators recorded every 10 min throughout the year 2020, including temperature and humidity. (6) The ILI dataset (https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html (accessed on 12 July 2023)): This dataset contains weekly data on influenza-like illness (ILI) patients recorded by the Centers for Disease Control and Prevention in the United States from 2002 to 2021, describing the proportion of ILI patients to the total number of patients.

4.1.2. Baselines

The DESTformer is compared with five state-of-the-art long-term forecasting methods based on the transformer, including ARIMA [37], Informer [1], and LogTrans [21], as well as FEDformer [3] and Autoformer [2], which combine time series decomposition with the transformer.

4.1.3. Evaluation Metrics

For the six long-term time series forecasting tasks, we choose MAE and MSE to evaluate the prediction performance of various models. MAE and MSE are calculated as

\begin{matrix} M A E = \frac{1}{T_{y}} \sum_{i = 1}^{T_{y}} |(y_{i} - {\hat{y}}_{i})|, \\ M S E = \frac{1}{T_{y}} \sum_{i = 1}^{T_{y}} {(y_{i} - {\hat{y}}_{i})}^{2} . \end{matrix}

(12)

4.1.4. Implementation Details

The method is optimized using the Adam [38] optimizer. For all methods, the learning rate is set to 0.00001, and the batch size is set to 32. The method is trained using L2 loss, and early stopping is applied within 20 epochs during training. All experiments are repeated five times with different random seeds, and the final results are reported as the average of the metrics. The code is implemented in PyTorch [39]. The training/validation/test data are split in a 6/2/2 ratio consistent with Informer. The convolutional kernel of MSC-Attention is selected from 2, 4, 8, 16, 32, and 64. The DESTformer consists of two encoder layers and one decoder layer. All models are trained/tested on a NIVIDA Tesla V100 32G GPU.

4.2. Main Results

4.2.1. Multivariate Forecasting Results

According to the experimental results on multivariate forecasting tasks, as shown in Table 2, DESTformer performs best in all prediction length settings over all benchmark tests. It verifies the effectiveness of explicit seasonal and trend decomposition via the high-frequency filtering (for seasonal data) and low-frequency filtering (for trend data), followed by multi-scale attention mechanisms in both trend and frequency views. Note that when the input length is set to 96 and the prediction length is set to 336, the MSE of the DESTformer decreases by 6.4%, 5.2%, 6.0%, 3.4%, and 8.9% on the EET dataset, Electricity dataset, Exchange dataset, Traffic dataset, and Weather dataset, respectively. Finally, on the relatively special ILI dataset, when the input length is set to 36 and the prediction length is set to 60, the DESTformer reduces the MSE by 4.4%. Overall, in the above experimental settings, the average MSE of the DESTformer decreases by 5.7%. According to the experimental results, it can clearly be seen that the DESTformer performs well on the Exchange dataset, and no obvious periodicity is observed. In addition, it can also be seen from the experimental results that as the prediction length increases, the performance change of the DESTformer remains relatively stable, indicating that it maintains better long-term stability. This is meaningful for real-world applications, such as weather warnings and long-term energy consumption planning.

To better illustrate the predictive performance of DESTformer, we visualize the predicted sequences and their corresponding true sequences on six datasets. As shown in Figure 2, DESTformer can capture the long-term temporal patterns and accurately fit the fluctuations of future long sequences across different tasks.

4.2.2. Univariate Forecasting Results

As shown in Table 3, the univariate experimental results on two typical datasets are listed. Compared with the experimental results of the baseline models, DESTformer still achieves the best performance in long-term prediction tasks. In particular, when the input length is set to 96 and the prediction length is set to 336, the DESTformer achieves an MSE reduction of 1.5% on the ETT dataset with obvious periodicity. For the Exchange dataset without obvious periodicity, the DESTformer outperforms other baselines by 4.8%, indicating that it has stronger long-term prediction capabilities.

4.3. Ablation Studies

According to the experimental results, it is preliminarily analyzed that the excellent performance of the DESTformer is attributed to the decomposition of time series into seasonal and trend terms through frequency domain decomposition techniques. For the seasonal part, a multi-view attention mechanism is designed to capture the complex periodicity of seasonal changes. For the trend part, a multi-scale attention mechanism is designed to capture sub-trends under different receptive fields and learn the long-term regularity of trend changes. Ablation experiments are specifically designed to verify the roles of the frequency domain decomposition module, MVI-Attention, and MSC-Attention in the model.

4.3.1. Traditional Time Series Decomposition vs. Frequency Decomposition

To investigate the difference between the frequency domain decomposition technique applied in the DESTformer and the traditional method of decomposing sequence data into seasonal and trend terms, the frequency domain decomposition technique in the original DESTformer is replaced with a common time series decomposition method, and this version of the model is denoted as DESTformer-f. The experimental results are shown in Table 4, and it can be seen that under our proposed frequency domain decomposition technique, the DESTformer can reduce the MSE by an average of 1.21% compared to DESTformer-f in all step length settings on various experimental datasets. Therefore, it is believed that the frequency domain decomposition technique can capture more spatiotemporal information in the frequency domain dimension of sequence data for downstream models to learn.

4.3.2. Self-Attention Family vs. MVI-Attention

To investigate the difference between the multi-view attention mechanism proposed in the DESTformer and the traditional attention mechanism, a second ablation experiment is set up, in which the multi-view attention mechanism of the DESTformer for seasonal term information is replaced with a common attention mechanism, and this version of the model is denoted as DESTformer-s. As shown in Table 4, according to the experimental results, it can be seen that our multi-view attention mechanism outperforms the traditional attention mechanism in terms of model performance. Notably, when the prediction step length is short, the multi-view attention mechanism can reduce the MSE of the model by more (6.4%). Therefore, it is believed that the multi-view attention mechanism can better extract seasonal term information from sequence data and retain as much information as possible for model prediction.

4.3.3. Self-Attention vs. MSC-Attention

To investigate the difference between the multi-scale attention mechanism proposed in the DESTformer and the traditional attention mechanism, a third ablation experiment is set up, in which the multi-scale attention mechanism of the DESTformer for trend term information is replaced with a traditional attention mechanism; this version of the model is named DESTformer-t. Table 4 demonstrates the experimental results. It can be seen that our proposed multi-scale attention mechanism also outperforms the traditional attention mechanism and has smaller MSEs in actual experiments. In particular, when using the multi-scale attention mechanism, if the prediction step length is large, our model also has a smaller MSE. Therefore, it is believed that the multi-scale attention mechanism of the DESTformer helps the model to better learn trend term information and thus achieve better performance in the prediction tasks.

4.4. Efficiency Analysis

To show the efficiency of DESTformer with the multi-view attention mechanism in the seasonal domain and multi-scale mixed attention mechanism in the trend domain, we compare the memory cost and time cost in the training process with state-of-the-art models, including Informer, Autoformer, and FEDformer. As shown in Figure 3, DESTformer achieves O(L) complexity in both time and space efficiency. In addition, it shows superior efficiency in long-term time series forecasting tasks.

4.5. Representation Disentanglement

We perform representation visualization analysis to validate the effectiveness of representation learning in our model. The experiments are conducted on the ETTh1 dataset with an input sequence length of 96 and a prediction length of 24. The learned seasonal and trend representations are visualized using the T-SNE technique. As shown in Figure 4, compared with Autoformer [2] and the Fedformer [3], the clustering effects of both seasonal and trend representations are more obvious, and there exists a clear boundary between seasonal representation and trend representation learned by DESTformer. We argue that such representations learned by the proposed multi-scale attention mechanisms in both trend and seasonal domains lead to improved performance in the forecasting tasks.

5. Conclusions

In this paper, we propose an explicit seasonal–trend decomposed transformer, called DESTformer, for long-term forecasting. DESTformer first explicitly extracts seasonal and trend components via high- and low-frequency filtering of the data after frequency transform. To enhance information utilization, a multi-scale attention mechanism in the trend domain and a multi-view attention mechanism in the frequency domain are proposed, capturing complex periodic changes and fine-grained sub-trends under different receptive fields, respectively. Experimental results verify the effectiveness of our method, thus providing a new approach for handling trend and seasonal patterns in long-term time series prediction tasks. Despite the outstanding performance of DESTformer in long-term forecasting tasks, there are still some limitations. First, the effect of the multi-scale attention mechanism is influenced by the sub-trend selection methods, but determining a suitable set of sub-trends adds an extra workload to the model training. Second, we only conducted experiments on datasets with obvious periodicity, and we hope to further test our model on more complex (even non-stationary) tasks in the future.

Author Contributions

Conceptualization, R.K.; Methodology, Y.W. and J.Z.; Project administration, Y.W. and R.K.; Software, J.Z.; Supervision, R.K.; Validation, Y.W.; Visualization, J.Z.; Writing—original draft, Y.W. and J.Z.; Writing—review and editing, R.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Defense Basic Scientific Research Program of China (No. JCKY2020212B003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
Guo, C.; Li, D.; Chen, X. Unequal Interval Dynamic Traffic Flow Prediction with Singular Point Detection. Appl. Sci. 2023, 13, 8973. [Google Scholar] [CrossRef]
Han, L.; Du, B.; Sun, L.; Fu, Y.; Lv, Y.; Xiong, H. Dynamic and multi-faceted spatio-temporal deep learning for traffic speed forecasting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual, 13 February–23 April 2021; pp. 547–555. [Google Scholar]
He, Z.; Zhao, C.; Huang, Y. Multivariate Time Series Deep Spatiotemporal Forecasting with Graph Neural Network. Appl. Sci. 2022, 12, 5731. [Google Scholar] [CrossRef]
Qin, H.; Ke, S.; Yang, X.; Xu, H.; Zhan, X.; Zheng, Y. Robust spatio-temporal purchase prediction via deep meta learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 4312–4319. [Google Scholar]
An, Y.; Zhang, L.; Yang, H.; Sun, L.; Jin, B.; Liu, C.; Yu, R.; Wei, X. Prediction of treatment medicines with dual adaptive sequential networks. IEEE Trans. Knowl. Data Eng. 2021, 34, 5496–5509. [Google Scholar] [CrossRef]
Zhu, J.; Tang, H.; Zhang, L.; Jin, B.; Xu, Y.; Wei, X. A Global View-Guided Autoregressive Residual Network for Irregular Time Series Classification. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Osaka, Japan, 25–28 May 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 289–300. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar]
Du, D.; Su, B.; Wei, Z. Preformer: Predictive transformer with multi-scale segment-wise correlations for long-term time series forecasting. arXiv 2022, arXiv:2202.11356. [Google Scholar]
Wang, Z.; Xu, X.; Zhang, W.; Trajcevski, G.; Zhong, T.; Zhou, F. Learning Latent Seasonal-Trend Representations for Time Series Forecasting. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Box, G.E.; Jenkins, G.M. Some recent advances in forecasting and control. J. R. Stat. Society. Ser. C Appl. Stat. 1968, 17, 91–109. [Google Scholar] [CrossRef]
Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Woo, G.; Liu, C.; Sahoo, D.; Kumar, A.; Hoi, S. Etsformer: Exponential smoothing transformers for time-series forecasting. arXiv 2022, arXiv:2202.01381. [Google Scholar]
Toharudin, T.; Pontoh, R.S.; Caraka, R.E.; Zahroh, S.; Lee, Y.; Chen, R.C. Employing long short-term memory and Facebook prophet model in air temperature forecasting. Commun.-Stat.-Simul. Comput. 2020, 52, 279–290. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Chen, W.; Xing, X.; Xu, X.; Pang, J.; Du, L. SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 775–788. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X.; Yan, X. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Cleveland, R.B.; Cleveland, W.S.; McRae, J.E.; Terpenning, I. STL: A seasonal-trend decomposition. J. Off. Stat 1990, 6, 3–73. [Google Scholar]
Jarrah, M.; Derbali, M. Predicting Saudi Stock Market Index by Using Multivariate Time Series Based on Deep Learning. Appl. Sci. 2023, 13, 8356. [Google Scholar] [CrossRef]
Asadi, R.; Regan, A.C. A spatio-temporal decomposition based deep neural network for time series forecasting. Appl. Soft Comput. 2020, 87, 105963. [Google Scholar] [CrossRef]
Ju, J.; Liu, F.A. Multivariate time series data prediction based on att-lstm network. Appl. Sci. 2021, 11, 9373. [Google Scholar] [CrossRef]
Taylor, S.J.; Letham, B. Forecasting at scale. Am. Stat. 2018, 72, 37–45. [Google Scholar] [CrossRef]
Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv 2019, arXiv:1905.10437. [Google Scholar]
Sen, R.; Yu, H.F.; Dhillon, I.S. Think globally, act locally: A deep neural network approach to high-dimensional time series forecasting. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Woo, G.; Liu, C.; Sahoo, D.; Kumar, A.; Hoi, S. CoST: Contrastive learning of disentangled seasonal-trend representations for time series forecasting. arXiv 2022, arXiv:2202.01575. [Google Scholar]
Gao, J.; Sultan, H.; Hu, J.; Tung, W.W. Denoising nonlinear time series by adaptive filtering and wavelet shrinkage: A comparison. IEEE Signal Process. Lett. 2010, 17, 237–240. [Google Scholar]
Gao, J.; Hu, J.; Tung, W.w. Facilitating joint chaos and fractal analysis of biosignals through nonlinear adaptive filtering. PLoS ONE 2011, 6, e24331. [Google Scholar] [CrossRef]
Wiener, N. Generalized harmonic analysis. Acta Math. 1930, 55, 117–258. [Google Scholar] [CrossRef]
Hyndman, R.J.; Khandakar, Y. Automatic time series forecasting: The forecast package for R. J. Stat. Softw. 2008, 27, 1–22. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9. [Google Scholar]
Lai, G.; Chang, W.C.; Yang, Y.; Liu, H. Modeling long-and short-term temporal patterns with deep neural networks. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 95–104. [Google Scholar]
Ariyo, A.A.; Adewumi, A.O.; Ayo, C.K. Stock Price Prediction Using the ARIMA Model. In Proceedings of the 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, Cambridge, UK, 26–28 March 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 106–112. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]

Figure 1. DESTformer architecture. The encoder combines traditional STL decomposition ideas to achieve separation and modeling of seasonal and trend components through representation learning of the original sequence. The decoder adopts an innovative frequency domain decomposition and representation learning method to further optimize and enhance the seasonal–trend representation.

Figure 2. Visualization analysis of the predicted sequences (red) and true sequences (blue) of DESTformer on six different datasets.

Figure 3. Efficiency analysis of the two special attention mechanisms proposed in this article with the DESTformer under the same experimental setup as Autoformer.

Figure 4. Visualizations of seasonal (red) and trend (blue) representations on ETTh1 dataset.

Table 1. Complexity analysis of different forecasting models.

Methods	Training		Testing
Methods	Time	Memory	Steps
LSTM [34]	$O (L)$	$O (L)$	L
Transformer [10]	$O (L^{2})$	$O (L^{2})$	L
LogTrans [21]	$O (L log L)$	$O (L^{2})$	1
Informer [1]	$O (L log L)$	$O (L log L)$	1
Autoformer [2]	$O (L log L)$	$O (L log L)$	1
FEDformer [3]	$O (L)$	$O (L)$	1
DESTformer	$O (L)$	$O (L)$	1

Table 2. Multivariate results with different prediction lengths

O \in {96, 192, 336, 720}

. We set the input length I as 36 for ILI and 96 for the others. A lower MSE or MAE indicates a better prediction.

Table 2. Multivariate results with different prediction lengths

O \in {96, 192, 336, 720}

. We set the input length I as 36 for ILI and 96 for the others. A lower MSE or MAE indicates a better prediction.

Models		DESTformer		FEDformer [3]		Autoformer [2]		Informer [1]		LogTrans [21]		ARIMA [37]		LSTM [34]
Metric		MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
EET*	96	0.198	0.266	0.203	0.287	0.255	0.339	0.365	0.453	0.768	0.642	1.354	0.829	2.041	1.073
	192	0.255	0.317	0.269	0.328	0.281	0.340	0.533	0.563	0.989	0.757	1.562	0.986	2.249	1.112
	336	0.311	0.358	0.325	0.366	0.339	0.372	1.363	0.887	1.334	0.872	1.842	1.212	2.568	1.238
	720	0.402	0.398	0.421	0.415	0.422	0.419	3.379	1.388	3.048	1.328	2.315	1.712	2.720	1.287
Electricity	96	0.187	0.296	0.193	0.308	0.201	0.317	0.274	0.368	0.258	0.357	0.252	0.362	0.375	0.437
	192	0.195	0.307	0.201	0.315	0.222	0.334	0.296	0.386	0.266	0.368	0.301	0.393	0.442	0.473
	336	0.203	0.325	0.214	0.329	0.231	0.338	0.300	0.394	0.280	0.380	0.322	0.403	0.439	0.473
	720	0.229	0.338	0.246	0.355	0.254	0.361	0.373	0.439	0.283	0.376	0.377	0.462	0.980	0.814
Exchange	96	0.127	0.259	0.148	0.278	0.197	0.323	0.847	0.752	0.968	0.812	1.365	0.911	1.453	1.049
	192	0.155	0.355	0.271	0.380	0.300	0.369	1.204	0.895	1.040	0.851	1.235	0.953	1.846	1.179
	336	0.436	0.469	0.460	0.500	0.509	0.524	1.672	1.036	1.659	1.081	1.563	1.123	2.136	1.231
	720	0.998	0.732	1.195	0.841	1.447	0.941	2.478	1.310	1.941	1.127	1.780	1.240	2.984	1.427
Traffic	96	0.562	0.354	0.587	0.366	0.613	0.388	0.719	0.391	0.684	0.384	0.632	0.534	0.843	0.453
	192	0.593	0.355	0.604	0.373	0.616	0.382	0.696	0.379	0.685	0.390	0.695	0.592	0.847	0.453
	336	0.603	0.377	0.621	0.383	0.622	0.337	0.777	0.420	0.733	0.408	0.732	0.620	0.853	0.455
	720	0.599	0.356	0.626	0.382	0.660	0.408	0.864	0.472	0.717	0.396	0.793	0.711	1.500	0.805
Weather	96	0.202	0.276	0.217	0.296	0.266	0.336	0.300	0.384	0.458	0.490	0.423	0.446	0.369	0.406
	192	0.254	0.318	0.276	0.336	0.307	0.367	0.594	0.544	0.658	0.589	0.492	0.511	0.416	0.435
	336	0.307	0.369	0.339	0.380	0.359	0.395	0.578	0.523	0.797	0.652	0.533	0.562	0.455	0.454
	720	0.395	0.417	0.403	0.428	0.419	0.428	1.059	0.741	0.869	0.675	0.591	0.612	0.535	0.520
ILI	24	3.019	1.256	3.228	1.260	3.483	1.287	5.764	1.677	4.480	1.444	4.121	1.562	5.914	1.734
	36	2.418	1.069	2.679	1.080	3.103	1.148	4.755	1.467	4.799	1.467	4.992	1.588	6.631	1.854
	48	2.586	1.065	2.622	1.078	2.669	1.085	4.763	1.469	4.800	1.468	5.312	1.895	6.736	1.857
	60	2.789	1.146	2.857	1.157	2.770	1.125	5.264	0.564	5.278	1.560	5.882	1.989	6.870	1.879

Table 3. Univariate results with different prediction lengths

O \in {96, 192, 336, 720}

. We set the input length I as 36 for ILI and 96 for the others. A lower MSE or MAE indicates a better prediction.

Table 3. Univariate results with different prediction lengths

O \in {96, 192, 336, 720}

. We set the input length I as 36 for ILI and 96 for the others. A lower MSE or MAE indicates a better prediction.

Models		DESTformer		FEDformer [3]		Autoformer [2]		Informer [1]		LogTrans [21]		ARIMA [37]		LSTM [34]
Metric		MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
EET*	96	0.058	0.177	0.072	0.206	0.065	0.189	0.080	0.217	0.075	0.208	0.211	0.362	1.921	0.963
	192	0.096	0.234	0.102	0.245	0.118	0.256	0.112	0.259	0.129	0.275	0.261	0.406	2.122	1.007
	336	0.127	0.258	0.130	0.279	0.154	0.305	0.166	0.314	0.154	0.302	0.317	0.448	2.448	1.146
	720	0.154	0.317	0.178	0.325	0.182	0.335	0.228	0.380	0.160	0.366	0.487	0.334	2.554	1.120
Electricity	96	0.244	0.356	0.253	0.370	0.341	0.438	0.258	0.367	0.288	0.393	0.367	0.391	0.382	0.446
	192	0.264	0.354	0.282	0.386	0.345	0.428	0.285	0.388	0.432	0.483	0.423	0.446	0.410	0.463
	336	0.337	0.429	0.346	0.431	0.406	0.470	0.336	0.423	0.430	0.483	0.401	0.422	0.437	0.495
	720	0.419	0.479	0.422	0.484	0.565	0.581	0.607	0.599	0.491	0.531	0.533	0.576	0.884	0.756
Exchange	96	0.117	0.274	0.154	0.304	0.241	0.387	1.327	0.944	0.237	0.377	0.118	0.285	1.563	0.995
	192	0.257	0.381	0.286	0.420	0.300	0.369	1.258	0.924	0.738	0.619	0.304	0.404	1.754	1.008
	336	0.497	0.450	0.511	0.555	0.509	0.524	2.179	1.296	2.018	1.070	0.736	0.598	2.035	1.003
	720	1.229	0.862	1.301	0.879	1.260	0.867	1.280	0.953	2.405	1.175	1.871	0.935	2.785	1.023
Traffic	96	0.198	0.298	0.207	0.312	0.246	0.346	0.257	0.353	0.226	0.317	0.422	0.435	0.756	0.369
	192	0.193	0.307	0.205	0.312	0.266	0.370	0.299	0.376	0.314	0.408	0.466	0.472	0.856	0.358
	336	0.215	0.319	0.219	0.323	0.263	0.371	0.312	0.387	0.387	0.453	0.562	0.518	0.779	0.430
	720	0.230	0.300	0.244	0.344	0.269	0.372	0.366	0.436	0.437	0.491	0.634	0.590	0.458	0.750
Weather	96	0.006	0.059	0.006	0.062	0.011	0.081	0.004	0.044	0.005	0.052	0.126	0.149	0.360	0.395
	192	0.005	0.058	0.006	0.062	0.008	0.067	0.002	0.040	0.006	0.060	0.188	0.212	0.452	0.421
	336	0.004	0.045	0.004	0.050	0.006	0.062	0.004	0.049	0.006	0.054	0.237	0.271	0.367	0.426
	720	0.006	0.053	0.006	0.059	0.009	0.070	0.003	0.042	0.007	0.059	0.312	0.351	0.522	0.503
ILI	24	0.629	0.596	0.708	0.627	0.948	0.732	5.282	2.050	3.607	1.662	3.442	1.612	4.886	1.652
	36	0.572	0.601	0.584	0.617	0.634	0.650	4.554	1.916	2.407	1.363	3.634	1.920	5.426	1.698
	48	0.702	0.652	0.717	0.697	0.791	0.752	4.273	1.846	3.106	1.575	3.885	1.989	5.265	1.775
	60	0.820	0.763	0.855	0.774	0.874	0.797	5.214	2.057	3.698	1.733	4.263	2.147	5.994	1.754

Table 4. Ablation study results on the ETTh1 dataset.

Input Length I		96			192			336
Prediction Length O		336	720	1440	336	720	1440	336	720	1440
DESTformer	MSE	0.291	0.398	0.503	0.302	0.397	0.467	0.312	0.388	0.552
DESTformer	MAE	0.355	0.393	0.488	0.327	0.402	0.455	0.366	0.407	0.533
DESTformer-f	MSE	0.318	0.416	0.522	0.320	0.408	0.472	0.325	0.401	0.568
DESTformer-f	MAE	0.366	0.401	0.508	0.335	0.418	0.472	0.379	0.425	0.549
DESTformer-s	MSE	0.362	0.457	0.544	0.377	0.458	0.505	0.358	0.433	0.598
DESTformer-s	MAE	0.398	0.453	0.532	0.365	0.452	0.497	0.346	0.425	0.587
DESTformer-t	MSE	0.327	0.453	0.563	0.355	0.446	0.544	0.346	0.435	0.604
DESTformer-t	MAE	0.325	0.447	0.556	0.346	0.435	0.537	0.334	0.424	0.597

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Zhu, J.; Kang, R. DESTformer: A Transformer Based on Explicit Seasonal–Trend Decomposition for Long-Term Series Forecasting. Appl. Sci. 2023, 13, 10505. https://doi.org/10.3390/app131810505

AMA Style

Wang Y, Zhu J, Kang R. DESTformer: A Transformer Based on Explicit Seasonal–Trend Decomposition for Long-Term Series Forecasting. Applied Sciences. 2023; 13(18):10505. https://doi.org/10.3390/app131810505

Chicago/Turabian Style

Wang, Yajun, Jianping Zhu, and Renke Kang. 2023. "DESTformer: A Transformer Based on Explicit Seasonal–Trend Decomposition for Long-Term Series Forecasting" Applied Sciences 13, no. 18: 10505. https://doi.org/10.3390/app131810505

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DESTformer: A Transformer Based on Explicit Seasonal–Trend Decomposition for Long-Term Series Forecasting

Abstract

1. Introduction

2. Related Work

2.1. Long-Term Time Series Forecasting

2.2. Time Series Decomposition

3. Methodology

3.1. Problem Definition

3.2. DESTformer Architecture

3.2.1. Frequency Decomposition

3.2.2. Model Inputs

3.2.3. Encoder

3.2.4. Decoder

3.3. Multi-View Attention

3.4. Multi-Scale Attention

3.5. Complexity Analysis

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets

4.1.2. Baselines

4.1.3. Evaluation Metrics

4.1.4. Implementation Details

4.2. Main Results

4.2.1. Multivariate Forecasting Results

4.2.2. Univariate Forecasting Results

4.3. Ablation Studies

4.3.1. Traditional Time Series Decomposition vs. Frequency Decomposition

4.3.2. Self-Attention Family vs. MVI-Attention

4.3.3. Self-Attention vs. MSC-Attention

4.4. Efficiency Analysis

4.5. Representation Disentanglement

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI