In this section, we present the mathematical formulation of the Bi-dual Inference framework. Specifically, sentiment classification and sentence generation are considered as dual forms of artificial intelligence tasks. Sentiment classification aims to classify the polarity of a given natural language sentence, while the dual task focuses on automatically generating sentences with specific emotional polarity categories. Let us consider two domains, denoted as X and Y. In this paper, X represents sentence generation, and Y represents polarity labeling. We define as the collection of training data from X to Y, and as the collection of training data from Y to X. It should be noted that is a subset of X, and is a subset of Y. Our objective is to learn two agents, and , which represent the models for the primal task and dual task respectively. We introduce the mapping from to , which represents the dual reconstruction error between x and , also known as the feedback signal. Here, x and are elements of X. Similarly, we have the mapping that is analogous to , where y and are elements of Y.
3.1. Primal Model Construction
To enhance the primal task model
f, we introduce a classical LSTM-based emotion classification method that utilizes a long short-term memory network [
39]. This approach involves supervised learning within the previous sequence-to-sequence learning framework. The methodology employs an LSTM model to sequentially process the input sequence, encoding it word-by-word according to Equations (
2)–(
6). This process allows for the extraction of a fixed-dimensional vector representation. Subsequently, another LSTM model is employed to extract the output from the vector sequence. This second LSTM model functions as a recursive neural network language model, adapting to the input sequence. The modified sequence-to-sequence learning model, as depicted in
Figure 1, aims to reconstruct the input sequence itself by replacing it with the output sequence within the sequence framework. In this sequence auto-encoder, a recurrent network reads the input sequence, generates a hidden state, and reconstructs the original sequence. Notably, the decoder and encoder networks share the same weights, as illustrated in
Figure 1.
In the context of LSTM, our objective is to estimate the conditional probability
. Here,
represents the input sequence, and
denotes the corresponding output sequence, which may have a different length
compared to
T. The LSTM achieves this by first obtaining a fixed-dimensional representation
v of the input sequence
, which is derived from the last hidden state of the LSTM. Subsequently, the LSTM computes the probability of
using a standard LSTM formulation, where the initial hidden state is set to the representation
.
In Equation (
1), each conditional probability distribution
is represented using a softmax function over all words in the vocabulary. The cyclic network of the automatic sequence encoder is employed to process the input sequence and generate a hidden state, which is then utilized to reconstruct the original sequence. In other words, by combining Equation (
1) with the basic LSTM formulas (Equations (
2)–(
6)), the LSTM model first reads the input sentence “
” to calculate the vector representations of “W”, “X”, “Y”, “Z”, and “eos”. Here, “eos” is a special symbol denoting the end of the sentence, allowing the model to define distributions for sentences of varying lengths. Subsequently, the obtained vector representations are used to calculate the probabilities of “
”, “
”, “
”, and “
”, resulting in the generation of a new sentence “
”. This generated sentence is then read into the hidden state as the input sequence for the next step, facilitating the reconstruction of the original sequence. Notably, LSTM reads the input statements in reverse order, which enhances the modeling of short-term dependencies and simplifies the optimization problem.
In the given equation, the logistic sigmoid function is used, and the vectors , , , and represent the input gate, forget gate, output gate, and cell activation vectors, respectively. These vectors are of the same size as the hidden vector h. The weight matrix subscripts indicate their purpose, such as representing the hidden-input gate matrix and representing the input–output gate matrix. The weight matrices connecting the cell to the gate vectors (e.g., ) are diagonal, meaning that each element in the gate vector only receives input from the corresponding element in the cell vector.
3.3. Bi-Dual Inference Construction
In this subsection, we will provide an explanation of how the above two models (dual model and primal model) are jointly trained. We establish a standard dual learning loss function [
41] for a single primal model
and a single dual model
, as illustrated in Equation (
9).
and refer to the data collected from domains X and Y, respectively. In this context, X represents sentiment-laden sentence data, while Y represents sentiment polarity labels. It is worth noting that is a subset of and is a subset of . The quantities and represent the number of samples in the datasets and , respectively.
Taking inspiration from pairwise inference [
37], if the inference of the original model or the pairwise model relies on the most natural and direct approach, we can train the loss functions of the original and pairwise tasks together by combining them. By minimizing the loss of the combined functions, we consider the output as the inference result. Specifically, for the original task and the dyadic task, their inference methods are given by Equations (
10) and (
11), respectively.
In this context, the hyperparameters
and
are used to balance the trade-off between the two losses, and their values will be adjusted based on the performance on the validation set. The loss functions for the single original model and the pairwise model are denoted as
and
, respectively. As sentiment analysis is considered a multi-categorical problem in this paper, we utilize the negative log-likelihood function as the loss function, as shown in Equations (
12) and (
13).
As the loss functions
and
are negative log-likelihood functions, their values range from 0 to infinity. The closer their values are to zero, the better the trained original and pairwise models are considered to represent the data. Many commonly used inference rules in machine learning tasks can be described using Equations (
14) and (
15).
By setting and to one, we can consider the extreme cases in dual inference. From this perspective, we perceive dual inference as a more comprehensive framework for inference.
It is important to note that the dual inference approach studied by Xia et al. [
37] does not involve retraining or making any modifications to the model for the original and pairwise tasks. However, in this paper, due to the presence of multiple
f and
g, multiple training processes are required. To address this, we utilize the probabilistic duality method [
36] to train the multiple
f and
g. The core algorithm is presented in Equations (
16) and (
17).
In the context of this discussion, m represents the total number of training sample pairs . The Lagrangian parameters and , along with the parameters and , are to be trained in the primal model and the dual model, respectively.
If we consider
f and
g in Equations (
10) and (
11) as
and
, respectively, the application of different training methods will lead to the generation of multiple
f and
g models. In this paper, we denote these models as
where
, and
where
. Consequently, we update Equations (
12) and (
13) to Equations (
18) and (
19) as shown below.
In order to further explore the potential of dual inference, we introduce multiple dual inference models
into the learning system. In dual structures, all models in the same direction map the space
X to the space
Y (or vice versa), resulting in functional similarities and variabilities. In this study, we generate multiple pairs of different
models through independent training using random seeds with different initializations and data access orders. Each proxy
model output provides feedback signals to
or
, enabling the models to receive additional gains during training. Having multiple proxy models generally leads to more reliable, robust, and comprehensive feedback, similar to a majority vote of multiple experts, which is expected to improve the final model performance. Therefore, for any
, where
, we define the following Equations (
20) and (
21).
and denote the sets of multiple original models and dual models, respectively. The equations that follow, with the constraint , represent the corresponding constraints.
Figure 3 illustrates the overall framework of our proposed model. The robots, represented by
or
, symbolize models in the same direction. These models exhibit both similarities and differences in function. The different colors indicate that they are independently trained using random seeds with different initialization and data access orders. The
and
in the middle of
Figure 3 represent the mutual feedback between the two models. By comparing the differences between the generated
and
x, or between
and
y, in a closed-loop system, providing feedback signals to the models can enhance their mutual training.
and
represent the sets of multiple original models and pairwise models, respectively. The primal task corresponds to sentiment classification for a given text, while the dual task involves sentence generation for polar sentiment labels.
To provide a more detailed explanation, we can divide
Figure 3 into two separate framework diagrams, as shown in
Figure 4 and
Figure 5. In both diagrams, robots with the same shape represent proxy models in the same direction. Multiple proxy models collaborate to reason about a task simultaneously. This arrangement enables the dual tasks to receive feedback from multiple agent models, leading to improved results.
For any
, the proxies
collaborate to generate a corresponding
using the function
. They then work together to reconstruct the original input
using the function
. The reconstruction error between
and
is also taken into account. The resulting dual learning loss is defined by Equation (
22).
represents the number of samples in the dataset , while represents the number of samples in the dataset . denotes the set of multiple original models, and represents the set of multiple pairwise models. The error measures the discrepancy between the original input x and the reconstructed input obtained from multiple primal models. This error serves as the feedback error necessary for updating the optimization model. Similarly, represents the feedback error between the original output y and the reconstructed output obtained from multiple dual models.
To enhance the model’s performance, we incorporate
N model pairs
for joint training. When
N = 1, the model reverts to a standard pairwise inference learning model. In summary, we implement the model using the Bi-dual inference algorithm (Algorithm 1). The algorithm begins by computing multiple
f and
g using Equations (
10) and (
11). During this computation process, the Lagrange parameters
and
, as well as the trained and optimized parameters
and
, are adjusted. After completing the first for-loop, the resulting multiple
f and
g are then used in Equations (
20) and (
21) for training, yielding the set
of multiple primal models and the set
of dual models. Finally, all parameters are fed into Equation (
22) to calculate the loss function of the model.
Algorithm 1 Bi-dual inference algorithm |
Input: 1: Date and ; Optimizers and ; Hyper-parameter ,; Beam search size K; Lagrange parameters and ; 2: : A set of short texts with emotional labels 3: : A set of emotion labels
Output: 4: Loss function of the model 5: repeat 6: Training and 7: Iteration number accumulation 8: Get a random minibatch of m pairs 9: Calculate the model gradient according to Equations ( 16) and ( 17) 10: Update the parameters of and and return to get , 11: 12: Training and 13: for do 14: Initialise and according to Equations ( 18) and ( 19), then calculate the and according to Equations ( 10) and ( 11) 15: 16: 17: Following Equations ( 20) and ( 21), Calculate the and 18: Update the parameters and return the value of the model 19: 20: end for 21: until function convergence
|