Next Article in Journal
meta.shrinkage: An R Package for Meta-Analyses for Simultaneously Estimating Individual Means
Previous Article in Journal
Transfer Learning for Operator Selection: A Reinforcement Learning Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Knowledge Distillation-Based Multilingual Code Retrieval

College of Computer Science and Technology, Zhejiang University, Hangzhou 310013, China
*
Author to whom correspondence should be addressed.
Algorithms 2022, 15(1), 25; https://doi.org/10.3390/a15010025
Submission received: 30 November 2021 / Revised: 7 January 2022 / Accepted: 12 January 2022 / Published: 17 January 2022

Abstract

:
Semantic code retrieval is the task of retrieving relevant codes based on natural language queries. Although it is related to other information retrieval tasks, it needs to bridge the gaps between the language used in the code (which is usually syntax-specific and logic-specific) and the natural language which is more suitable for describing ambiguous concepts and ideas. Existing approaches study code retrieval in a natural language for a specific programming language, however it is unwieldy and often requires a large amount of corpus for each language when dealing with multilingual scenarios.Using knowledge distillation of six existing monolingual Teacher Models to train one Student Model—MPLCS (Multi-Programming Language Code Search), this paper proposed a method to support multi-programing language code search tasks. MPLCS has the ability to incorporate multiple languages into one model with low corpus requirements. MPLCS can study the commonality between different programming languages and improve the recall accuracy for small dataset code languages. As for Ruby used in this paper, MPLCS improved its MRR score by 20 to 25%. In addition, MPLCS can compensate the low recall accuracy of monolingual models when perform language retrieval work on other programming languages. And in some cases, MPLCS’ recall accuracy can even outperform the recall accuracy of monolingual models when they perform language retrieval work on themselves.

1. Introduction

The research on code retrieval can be divided into two broad categories according to the methods used: Information Retrieval-Based Methods and Deep Learning Model-Based Methods. Information Retrieval-Based Methods are more based on traditional search methods, the main idea is to improve work based on text similarity, and to perform code retrieval through search techniques combined with code features. For example, Luan et al. [1] proposed a structured code recommendation tool called Aroma, which implements a method of searching code by using coding. They divide the retrieval process into two stages: stage 1, perform a small range lightweight search, and then stage 2, a further in-depth search based on the previous results searched in stage 1. On top of this, Lv et al. [2] discovered a way to better connect the characteristics of code by focusing on the API calls in coding. Standard APIs have specific functions and corresponding documentation descriptions, this enables them to turn code search tasks into simple similarity matches between natural language descriptions and API documentation. However, these information retrieval-based methods cannot uncover the deep connection between natural language and code language, this leads to the lack of accuracy in methods. With the rapid development of NLP, some scholars start using deep learning models to solve the problem of code retrieval to prove its accuracy.
The main strategy of the Deep Learning-Based Approach is to use a neural network approach to map code snippets and query statements in the same vector space. Husain et al. [3] came up with a plain and typical example to propose a basic framework where they simply consider the code as a text, encode it using several common methods for text embedding, and map the code and description to the same high-dimensional space for learning. Not long after, Gu et al. [4] went further in this direction, by not only considering the code text as features, but also function names and API (Application Programming Interface) sequences. After that, many scholars discovered more code-specific features to build new models. Haldar et al. [5] added both tokens and AST (Abstract Syntax Tree) information on the one hand, and improved join embedding, on the other hand, they argue that only calculating similarity in the last step (i.e., only the overall similarity is considered, not the local similarity) would cause information loss, so they fused both CAT (An AST-Based Model) and MP (A Multi-Perspective Model) methods and proposed the MP-CAT Model. Sachdev et al. [6] address an unsupervised learning task by proposing Neural Code Search (NCS), using features such as function names, function calls, enumerated quantities, string literals, annotations, TF-IDF weights, and word vectors to construct high-dimensional vectors for retrieval. Cambronero et al. [7] made improvements based on the NCS algorithm and proposed an improved idea of UNIF by adding a solution to the unsupervised learning algorithm NCS to improve model performance. With this model, a small number of supervised samples, can be comparable to some supervised learning algorithms. Meanwhile, some scholars are researching other issues related to code retrieval, for example, Yin and Neubig [8] investigated the task of code generation and argued that the current approach is to view it as a seq2seq generation task, but does not take into account that the code language has a specific syntactic structure. From this, an AST tree generation by natural language is proposed, and tools are used to convert the AST tree into code. Analogous to large pre-training models such as ELMo, GPT, BERT, and XLNet, Feng et al. [9] and Kanade et al. [10] proposed the pre-training models with codes studied.
When modern software engineers develop a software product, it often requires more than just one programming language, thus developers are faced with the need to search for multiple code languages during the development process. A recent survey of open-source projects has shown that the use of multiple languages is rather universal, with a mean number of 5 languages used per project [11]. Thus, multilanguage software development (MLSD) seems to be common, at least in the open-source world [12]. As mentioned above, Both the information retrieval-based approaches and the deep learning-based approaches deal with the mapping from a single natural language to a single programming language. Thus this paper proposed the idea of mapping a single natural language to a multi-programming language, which is new in this field. The goal that we want to achieve here is, when a query of natural language is inputted, we were able to find multiple programming languages (such as Java, PHP, Go, etc.) codes that have the functionality that matches the natural language description. And this is done with the use of knowledge distillation.
Hinton et al. [13] proposed the concept of knowledge distillation, the core idea of knowledge distillation is to first train a complex model (known as the Teacher Model) and then use the output or intermediate state of this model to train a smaller model (known as the Student Model). The main contribution of knowledge distillation is model compression, which has been widely studied and utilized in many areas of deep learning, such as natural language processing, speech recognition, and computer vision. This technique is also used for natural language translation tasks. Many scholars have done a lot of exploratory work on multilingual translation models [14,15,16,17], and NMT-based multilingual translation models have been discovered. Xu et al. [18] proposed to transfer knowledge from individual models to multilingual models using knowledge distillation, which is commonly used for studying model compression and knowledge migration and mostly fits quite well with the multilingual code search environment. A large and deep Teacher Model (or an ensemble of multiple models) is usually trained first, and then a smaller and shallower Student Model is trained to mimic the behavior of the Teacher Model. The Student Model can approach or even outperform the accuracy of the complex Teacher Model by knowledge distillation. This paper uses knowledge distillation techniques to fuse six pre-trained monolingual models into one student model. By doing so, the size of the model is reduced significantly. At the same time, the student model’s performance is almost the same with each teacher model on the test set, on cases like Ruby and JavaScript, the student model even outperformed the teacher model. After redoing the experiments on different code and query encoders, we confirm that this method can be used on a wide range of encoders.
We summarize our contributions as follows:
  • We propose a code search model that efficiently and accurately addresses multi-programming language fusion. A single model can solve the problem of searching for multiple programming languages.
  • Compared to multiple models, our model has fewer parameters. Also, the data set requirements are lower because the data sets are complementary between different languages.
  • The ability to uncover connections between different programming languages makes the model highly extensible, and this provides some support for languages with relatively small corpora.
Background
Joint Vector Representations, also known as Multimodal Embedding [19], are very common in code retrieval tasks and most deep-learning-based methods use this. Joint Vector Representations is a method to learn the connection between two heterogeneous structures, which maps data of two different structures into the same high-dimensional space [20], so that the corresponding data fall as close as possible to each other in the high-dimensional space, while making the non-corresponding data as far away from each other as possible. Such an approach also makes the query process more intuitive, and when performing a search, it is only necessary to find some points in the high-dimensional space that are closest to the target point, i.e., the nearest neighbor problem in the high-dimensional space.
This paper used joint embedding to learn the relevance between natural language description and code. As shown in Figure 1, code segments and natural language representations are mapped to the same high-dimensional space. The code for bubble sort and “bubble sort” is mapped to relatively close locations, as is the case for “quick sort”.
Paper structure
The remainder of the paper is organized as follows. Section 2 presents our proposed framework, this includes the teacher model network structure and the learning process of distilling knowledge using models to train student model. Section 3 describes the experimental setup and details. Section 4 presents an analysis of the experimental results. Section 5 concludes the paper.

2. Multi-Programming Language Code Search

Inspired by multilingual translation models, we propose a novel deep learning model, MPLCS, to solve the task of multi-programming language code search. Each programming language has its own syntactic structure and thus programming languages are heterogeneous from one another. For the heterogeneous problem, the method of joint embedding mentioned above is used to map them into the same high-dimensional space, and the semantic similarity of each heterogeneous data is measured by the similarity degree. For the problem of multi-model fusion, the solution we adopt is to use knowledge distillation to handle it.

2.1. Overview

There are two main parts, one for training the Teacher Model and one for training the Student Model. We will be discussing the network structure of the Teacher model in Section 2.2. As for the Student Model, it has the same structure of the Teacher Model and can take corresponding Teacher Model as input, then fuse the properties of different language models. The components are elaborated in Section 2.3.

2.2. Teacher Model

We follow Hamel et al’s study and use the same monolingual model structure, i.e., 1dCNN, NBOW, and self-attention, these are commonly used methods based on token sequences. In this paper, we will only introduce the model structure of self-attention and the subsequent experiments will be based on this model, since it performed best among the 3 methods. We embedded the code in the same way as embedding the natural language—an encoding of their token sequences with an added attention vector. In general, the model needs to be trained for the following parts as shown in Figure 2: code_vocab and query_vocab of the code and the description of the embedding layers, the fully connected layers of the corresponding code and description, and the attention vector.
Here, we define two embedding lexicons: code_vocab and query_vocab, each line corresponds to a specific code token or objects of the description token:
c o d e _ v o c a b R X × d
q u e r y _ v o c a b R Q × d
X is the set of code token dictionaries, Q is the set of natural language description dictionaries, and d is the dimension of the embedding, which we set to 128 in our experiments. Finding the embedding of a code or query is a simple matter of finding the corresponding line.
For a line of code C = c t o k e n 1 , c t o k e n 2 , · · · , c t o k e n n and its corresponding natural language description Q = q t o k e n 1 , q t o k e n 2 , · · · , q t o k e n m , with the c t o k e n i ( i = 1 , · · · , n ) as code token and q t o k e n i ( i = 1 , · · · , m ) as description token. After embedding (random initialization) we can obtain:
c i = e m b e d d i n g ( c t o k e n i ) , q i = e m b e d d i n g ( q t o k e n i ) R d
And after going through a full connected layer, we have:
c ˜ i = t a n h ( W c · c i ) , q ˜ i = t a n h ( W q · q i )
In which W c , W q R d × d , t a n h is the hyperbolic tangent function, which is a common monotonous nonlinear activation function, taking the value from (−1, 1).
Finally, we use the attention vector to aggregate these token vectors, it is essentially a weighted average aggregation. The main process is to calculate the weight of each token in the current block of code or natural language sentence, in terms of goals, we naturally hope that the token that can represent the code or sentence will occupy a greater weight, and this is where “attention” comes into play. In the beginning, the attention vector a R d will be randomly initialized and study the model simultaneously during the training process. The weight of each token is calculated by dotting the vector of the token with the attention vector and then normalizing it to ensure that the weights sum up to 1, the weights of each corresponding code token vector c ˜ i are calculated as follow:
α i = e x p ( c ˜ i T · a ) j = 1 n e x p ( c ˜ j T · a )
The purpose of using exp is to ensure that the weights are positive, as in the standard softmax function, and we divide by the sum of all terms to ensure that the sum of the weights is 1, the calculation of natural language description token vector follows a similar pattern. Once α i ( i = 1 , , n ) is calculated, the final code vector can be obtained by summing the linear weights of the code token vectors { c ˜ 1 , c ˜ 2 , , c ˜ n } . Code Vector represents the entire code segment, and it’s expressed as follow:
v c = i = 1 n α i · c ˜ i
Similarity Model
After obtaining the code vector w a i t and the description vector, we want the code and description vectors to be co-embedding, so we measure the similarity between these two vectors. We measure this by using the cosine similarity formula, which is defined as
cos ( v c , v q ) = v c T · v q v c v q
The higher the similarity, the higher the correlation between the code vector and the description vector. In summary, the MPLCS model takes a pair of ⟨codes, descriptions⟩ as input, and calculates their cosine similarity to measure the strength of their correlation.
Teacher Model Learning
Contrastive representation learning is often used for code retrieval tasks [4,5,21,22,23], and our experiment will use the contrastive loss function as the loss function of the model.We now describe how to train the MPLCS model in two stages, the first stage is to train the six Teacher Models. Both codes and descriptions are embedded into a unified vector space. The ultimate goal of joint embedding is that if a code fragment and a description have similar semantics, their embedding vectors should be close to each other. In other words, given an arbitrary code fragment C and an arbitrary description D, we want it to predict high similarity (close to 1) if D is the correct description of C, otherwise, it will only little similarity (close to 0).
Therefore, we need to use negative samples to construct our loss function. Essentially we consider this problem: we construct each training instance as a triad C , D + , D : for each code fragment C, there is a positive description D + (the correct description of C) and a negative description D (the wrong description of C) chosen randomly from a pool of all D + (negative samples are derived from positive samples). When trained on the C , D + , D set, MPLCS predicts the cosine similarity of the C , D + and C , D pairs and minimizes the rank loss, that is, minimizing the following equation.
L o s s t e a c h e r = max ( 0 , 1 cos ( c , d + ) + cos ( c , d ) )
In this formula, d + represents positive description, d represents negative description. In practice, we use the strategy of obtaining the cosine between two of the N code vectors and the N corresponding description vectors. This gives us an N N matrix, with positive sample values on the main diagonal and negative sample values everywhere else. We want the positive sample value to be as large as possible and the negative sample value to be as small as possible, so we subtract the value on each diagonal by 1, and the main goal is to make all values as small as possible. The formula is described as follows:
s u b m a x ( X , i ) = max j i 1 j n X [ i , j ] L o s s ( X ) = 1 n i = 1 n ( ( 1 X [ i , j ] ) + s u b m a x ( X , i ) )

2.3. Student Model

After training the six Teacher Models, we begin to train the Student Model. Every language input will obtain two sets of vectors during the encoding step, one set constructed by the Student Model and the other one constructed by the Teacher Model for the corresponding language, as shown in Figure 3. The composition of the loss function consists of three parts: (1) STUDENT’s code vector and STUDENT’s description vector, (2) the code vector of the TEACHER and the description vector of the STUDENT, (3) the code vector of the STUDENT and the description vector of the TEACHER.
L o s s s t u d e n t s e l f = max ( 0 , 1 cos ( c s t u d e n t , d s t u d e n t + ) + cos ( c s t u d e n t , d s t u d e n t ) )
L o s s K D = t e a c h e r i T e a c h e r max ( 0 , 1 cos ( c s t u d e n t , d s t u d e n t + ) + cos ( c s t u d e n t , d s t u d e n t ) ) + max ( 0 , 1 cos ( c s t u d e n t , d t e a c h e r + ) + cos ( c s t u d e n t , d t e a c h e r ) )
L o s s s t u d e n t A L L = ( 1 λ ) L o s s s t u d e n t s e l f + λ L o s s K D
In the formula, T e a c h e r is the set of Teacher Models, which is the set of six Teacher Models in this paper. For each of the Teacher Models, two additional sets of loss functions are computed, as shown in L o s s K D , we replace code vector to the loss function in Teacher Model’s and replace description vector to Teacher Models’. The parameter λ is used to adjust the contribution of the teacher model in the student model. We explore the effect of this parameter on Teacher Models in our experiment. Even the two parts of L o s s K D can be scaled differently to serve as a focus for one part of the work, and this part of the work can be further elaborated for future studies.
The training process is shown in Algorithm 1. L is the number of language varieties, which is taken as 6 in this paper, l [ L ] denotes the language number, D l denotes the training set for the language with the number l, θ M represents the parameter of the multilingual Student Model, the corresponding θ t e a c h e r l represents a parameter of Teacher Model for the language with the number l. Our algorithm takes the pre-trained Teacher Models as input. It is important to note that the training set for the Teacher Model can either be shared with the Student training set for that language or choose a separate training set. Similarly, the structure of the student network model can be set to be the same or different from that of the teacher network model. For convenience, the same data set and network model structure are chosen in this paper. Notice that lines 7–10 of Algorithm 1 made a loss function selection. This is based on the strategy that: if the Student Model is already performing better than the Teacher Model for a particular language, the Teacher Model will not be introduced, but this setting is not fixed, and the accuracy of this language may be reduced later when training other languages, that is then the Teacher Model will be reintroduced. The setting of whether or not to introduce a Teacher Model involves parameter τ , as described in lines 14–22, where the accuracy of the Student Model is higher than that of the Teacher Model, then the Teacher Model is not introduced.
Algorithm 1. Knowledge distillation in multiple code languages
Input: 
training set { D l } l = 1 L , trained Teacher Models for L languages { θ t e a c h e r l } l = 1 L , learning rate η , total number of training steps τ , distillation inspection step length τ c h e c k , distillation accuracy threshold τ
Output: 
The trained Student Model
  1: Randomly initialized Student Model parameters θ M , current step count set to T = 0 , cumulative gradient g = 0 , For each Teacher Model, mark f l = T r u e , l [ L ]
  2: while T < T do
  3:      T = T + 1
  4:      g = 0
  5:    for  l [ L ]  do
  6:        Randomly select a batch of data ( c l , d l ) from the training set D l
  7:        if  f l = = T r u e  then
  8:           Calculating gradient on loss function, g + = L o s s s t u d e n t A L L / θ M
  9:        else
  10:           Calculating gradient on loss function, g + = L o s s s t u d e n t s e l f / θ M
  11:        end if
  12:    end for
  13:    Update model parameter: θ M = η g
  14:    if  T % T c h e c k = = 0  then
  15:        for  l [ L ]  do
  16:           if  A c c u r a c y ( θ M ) < A c c u r a c y ( θ t e a c h e r l ) + τ  then
  17:                f l = T r u e
  18:           else
  19:                f l = F a l s e
  20:           end if
  21:        end for
  22:    end if
  23: end while

3. Experiments

3.1. Data Preparation

The experimental data was selected from the publicly available dataset collected by Hamel et al. [3]. They collected corpus from publicly available open-source GitHub repositories, and to weed out a portion of low-quality project code, libraries.io was used to identify all projects that were quoted by at least one other project, and were ranked by the number of stars and forks indicated by “popularity” ranking. The statistic information of the database is listed in Table 1. Data set is divided into training set, validation set, and test set according to an 80:10:10 ratio.
However, the obtained data through the corpus cleaning is still unsatisfied. First of all, function annotation is essentially different from inquired sentences, so the format of language is not the same.
Code and annotations are often written by the same author at the same time and therefore they appear to be the same vocabulary, unlike search queries which cover many different terms. Secondly, despite we put enough effect on data cleaning, the extent to which each annotation accurately describes its relevant code fragment is still uncertain. For example, some annotations are obsolete in terms of the code they describe or the object of the comment is a localized part that the author wants to focus on rather than the whole function. Finally, we are aware of some annotations are written in other languages such as Japanese and Russian, and that our evaluation dataset focuses on English queries. In order to address this issue, some scholars have chosen to add some other conditional features to strengthen the characteristics of samples. When collecting and sorting corpus, they tend to select previously available code and corresponding descriptive annotations, and in addition, they also collect relevant query information, such as gathering asked questions from Stack Overflow and attracting the code and corresponding annotations from the answers, this method can help to propose some models that make good use of this new information, whereas this type of data is not mainstream (most codes do not have query information) For example, a company’s internal code does not have relevant query questions, so this paper is still experimenting with the original dataset.

3.2. Vocabulary

For a fixed-size dictionary, the traditional tokenization based technique of simply segmenting the text with spaces and symbols has many drawbacks, such as the inability to deal well with unknown or rare words (the OOV out-of-vocabulary problem); the nature of the language’s own lexical construction, and the difficulty of learning the root word associations with the traditional method. This leads to the lack of generalization ability with the traditional approach. Byte-Pair-Encoding (BPE) is a method for solving such issues. unknown or rare words can be classified as unregistered words. Unregistered is known as words that do not appear in the training corpus, but appear in the test corpus. When we work with NLP tasks, we usually generate a vocabulary list (dictionary) based on the corpus, for the words in the corpus that have a frequency greater than a certain threshold, they will be put into the dictionary, and encode all words below that threshold as "#UNK". The advantage of this approach is its simplicity, but the problem is that it’s difficult for our model to handle unregistered words if they appear in the test corpus. Usually, our dictionaries are word-level, meaning that they are based on words or phrases, but this inevitably leads to the problem of unregistered words, because it is impossible to design a very large dictionary that covers all words. In addition, another type of lexicon is character-level, which is to design a lexicon with a single letter or Chinese character as the basic word. This approach can theoretically solve the problem of unregistered words because all words are composed of letters, but the disadvantage of this is that the model granularity is too fine and lacks semantic information. Rico et al. [24] proposed a sub-word based approach to generate lexicon, which combines the advantages of word-level and character-level by learning the substrings of characters with high frequency in all words from the corpus and then merging these substrings of characters with high frequency into a lexicon, this dictionary contains both word-level substrings and character-level substrings. We used the BPE technique to generate both query and code vocabulary.

3.3. Evaluation

Mean Reciprocal Rank (MRR)
MRR is commonly used in the recommended system as an evaluation metric. It evaluates the performance of the retrieval system by using the ranking of the correct retrieval results in retrieval results. The formula is as follows.
M R R = 1 | Q | q = 1 | Q | 1 r a n k i
SuccessRate@k
SuccessRate@k is a common metric to evaluate whether an approach can retrieve the correct answer in the top k returning results. It is widely used by many studies on the code search task. The metric is calculated as follows:
S u c c e s s R a t e @ k = 1 | Q | q = 1 | Q | δ ( R a n k q k )
where Q denotes the set of queries, R a n k q denotes the highest rank of the hit snippets in the returned snippet list for the query; δ ( ) denotes an indicator function that returns 1 if the Rank of the qth query ( R a n k q ) is smaller than k otherwise returns 0. SuccessRate@k is important because a better code search engine should allow developers to find the desired snippet by inspecting fewer results.

3.4. Experiment Setup

3.4.1. Data Pre-Processing

First, we tokenize the code and description of the training set for all six languages and use the BPE method to construct a code dictionary and a description dictionary, the size of both dictionaries is set to 30,000. Then, codes and descriptions in the dataset are then transformed into index sequences, the code length is set to 200, the description length is set to 30, and a pad operation is used if the length is insufficient.

3.4.2. Teacher Model Training

We shuffle the training set, the batch size is set to 512, the epoch limit is set to 500, the optimization algorithm is Adam, and the learning rate is 0.1. Early Stop mechanism is used during the training process, i.e., setting a tolerance value of patience = 5, when a model trained on one epoch performed better on the validation set than a model trained on the next 5 epochs, stop training and save the model corresponding to this epoch. After training for each language, six teacher models are obtained. In addition, in order to establish that the method of knowledge distillation is indeed effective, We also set up a dataset that fused six languages together to train a model, which is shown by the ALL rows in Table 2 and Table 3.

3.4.3. Student Model Training

The network structure of the student model is consistent with that of the teacher model. The results of the teacher models’ encoder are used to guide the student model during the training process. The details are described in Algorithm 1 in the previous section.

3.4.4. Eevaluation Setting

The evaluation method for the teacher model and the student model was described in the previous section. The batch size for both MRR and SuccessRate@k is set to 1000, that is | Q | = 1000 , k = 1 , 5 , 10 for SuccessRate@k.

3.4.5. Lambda Parameter Exploration

In order to find out the magnitude of impact the teacher model has on the student model, the parameters in Equation (12) were set to different values, to explore the effect of different weightings of the teacher model on the student model. As the results under different encoders show a similar pattern, We explore the problem of λ only for the teacher model and student model that both code and query encoder are self-attention.

3.4.6. Experiment Equipment

The equipment uses 3 RTX 1080Ti 11 GB, and the training time per epoch is around 200–400 s. The number of epochs for a teacher model or student model is around 20–40.

4. Experiment Results

We have prepared several different sets of code and query encoding methods and combined them into different models, including the common 1dCNN, NBOW, and self-attention. We have also prepared a test set for each language and tested it on each monolingual model and MPLCS respectively, the MRR results and SuccessRate@k results are shown in Table 2 and Table 3 respectively. The results indicate that the monolingual model’s prediction performed better only for its own language, however, it did not perform well for other languages (Ruby is because the training set is too small). Such a result holds in all models. Otherwise, we can observe the similarity between programming languages through this table, for example between Python and PHP. Notice that the ALL model trained by fusing the six data, has consistent test results across various languages. Although the accuracy rate is not as high as monolingual to itself, it is more accurate than monolingual to other languages. Our model outperforms ALL on almost every language. It can be confirmed that the student model does indeed study the knowledge of the teacher model through knowledge distillation techniques. Notice that MPLCS is superior to the teacher model on the Ruby and JavaScript test sets. This is due to the fact that the training sets for both languages are relatively small, and the multilingual fusion model can compensate for the small training set to some extent.
The effect of λ on the student model can be seen in Table 4. We can see that as λ increases from 0 and the student model receives more guidance from the teacher model, which leads to the gradual increase in MMR across all models. When λ = 0.8 , The average MRR for the six languages reached a maximum, relatively close to the MMR when λ = 0.9 and λ = 1.0 . The results indicate that as more teacher models are instructed, the more knowledge student models are learned.

5. Conclusions

In this paper we present a new idea for semantic code retrieval-multi-code language code retrieval. By introducing the knowledge distillation technique, we established a Multi-Programming Language Code Search (MPLCS) model. The model can fuse several monolingual teacher models into a single student model, it supports multi-language code retrieval and also compensates for the deficiencies for languages where the training set is too small. In addition, MPLCS has no restrictions on the encoding method, it can be applied using a variety of different encoding methods. This paper only applied a general knowledge distillation technique and used only the results encoded from the teacher’s model, thus the model is not significant in terms of accuracy improvement. However, this paper could have an intriguing effect on multi-code language code retrieval tasks.
Open Questions
  • In this paper, only the simplest features of the code are obtained, which treats it as a new natural language, other features such as API sequences, information from AST trees were not used in this paper, further research on these features could better improve the accuracy.
  • As mentioned before, a high-quality training set can also greatly improve the practical meaning of the conclusions.
  • Translation between different programming languages is also a very interesting research direction.
  • Multi-natural language to multi-programming language is also a valuable research direction, but it will require a more comprehensive dataset as support.

Author Contributions

Conceptualization, W.L.; Data curation, W.L.; Formal analysis, W.L. and J.X.; Funding acquisition, Q.C.; Investigation, W.L. and J.X.; Methodology, W.L.; Project administration, Q.C.; Resources, W.L.; Software, W.L.; Supervision, J.X. and Q.C.; Validation, W.L. and J.X.; Visualization, W.L.; Writing—original draft, W.L.; Writing—review & editing, W.L. and J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the National Key Research and Development Program of China, grant number 2018YFB2101200.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset is available on https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip. The content in brackets is the programming language name, such as java.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Luan, S.; Yang, D.; Barnaby, C.; Sen, K.; Chandra, S. Aroma: Code recommendation via structural code search. Proc. ACM Program. Lang. 2019, 3, 1–28. [Google Scholar] [CrossRef] [Green Version]
  2. Lv, F.; Zhang, H.; Lou, J.g.; Wang, S.; Zhang, D.; Zhao, J. Codehow: Effective code search based on api understanding and extended boolean model (e). In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; pp. 260–270. [Google Scholar]
  3. Husain, H.; Wu, H.H.; Gazit, T.; Allamanis, M.; Brockschmidt, M. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv 2019, arXiv:1909.09436. [Google Scholar]
  4. Gu, X.; Zhang, H.; Kim, S. Deep code search. In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), Gothenburg, Sweden, 27 May–3 June 2018; pp. 933–944. [Google Scholar]
  5. Haldar, R.; Wu, L.; Xiong, J.; Hockenmaier, J. A multi-perspective architecture for semantic code search. arXiv 2020, arXiv:2005.06980. [Google Scholar]
  6. Sachdev, S.; Li, H.; Luan, S.; Kim, S.; Sen, K.; Chandra, S. Retrieval on source code: A neural code search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, Philadelphia, PA, USA, 18 June 2018; pp. 31–41. [Google Scholar]
  7. Cambronero, J.; Li, H.; Kim, S.; Sen, K.; Chandra, S. When deep learning met code search. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Tallinn, Estonia, 26–30 August 2019; pp. 964–974. [Google Scholar]
  8. Yin, P.; Neubig, G. A syntactic neural model for general-purpose code generation. arXiv 2017, arXiv:1704.01696. [Google Scholar]
  9. Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. Codebert: A pre-trained model for programming and natural languages. arXiv 2020, arXiv:2002.08155. [Google Scholar]
  10. Kanade, A.; Maniatis, P.; Balakrishnan, G.; Shi, K. Learning and evaluating contextual embedding of source code. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 5110–5121. [Google Scholar]
  11. Mayer, P.; Bauer, A. An empirical analysis of the utilization of multiple programming languages in open source projects. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering, Nanjing, China, 27–29 April 2015; pp. 1–10. [Google Scholar]
  12. Mayer, P.; Kirsch, M.; Le, M.A. On multi-language software development, cross-language links and accompanying tools: A survey of professional software developers. J. Softw. Eng. Res. Dev. 2017, 5, 1–33. [Google Scholar] [CrossRef]
  13. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
  14. Johnson, M.; Schuster, M.; Le, Q.V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G.; et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 2017, 5, 339–351. [Google Scholar] [CrossRef] [Green Version]
  15. Firat, O.; Cho, K.; Bengio, Y. Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv 2016, arXiv:1601.01073. [Google Scholar]
  16. Ha, T.L.; Niehues, J.; Waibel, A. Toward multilingual neural machine translation with universal encoder and decoder. arXiv 2016, arXiv:1611.04798. [Google Scholar]
  17. Lu, Y.; Keung, P.; Ladhak, F.; Bhardwaj, V.; Zhang, S.; Sun, J. A neural interlingua for multilingual machine translation. arXiv 2018, arXiv:1804.08198. [Google Scholar]
  18. Tan, X.; Ren, Y.; He, D.; Qin, T.; Zhao, Z.; Liu, T.Y. Multilingual neural machine translation with knowledge distillation. arXiv 2019, arXiv:1902.10461. [Google Scholar]
  19. Xu, R.; Xiong, C.; Chen, W.; Corso, J. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; Volume 29. [Google Scholar]
  20. Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
  21. Wan, Y.; Shu, J.; Sui, Y.; Xu, G.; Zhao, Z.; Wu, J.; Yu, P.S. Multi-modal attention network learning for semantic source code retrieval. arXiv 2019, arXiv:1909.13516. [Google Scholar]
  22. Zeng, C.; Yu, Y.; Li, S.; Xia, X.; Wang, Z.; Geng, M.; Xiao, B.; Dong, W.; Liao, X. deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search. arXiv 2021, arXiv:2103.13020. [Google Scholar]
  23. Gu, J.; Chen, Z.; Monperrus, M. Multimodal Representation for Neural Code Search. In Proceedings of the 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), Luxembourg, 27 September–1 October 2021; pp. 483–494. [Google Scholar]
  24. Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
Figure 1. Conceptual diagram of joint embedding in a code search task.
Figure 1. Conceptual diagram of joint embedding in a code search task.
Algorithms 15 00025 g001
Figure 2. Details in Teacher Models whose encoder is self-attentions.
Figure 2. Details in Teacher Models whose encoder is self-attentions.
Algorithms 15 00025 g002
Figure 3. Schematic diagram of the overall model.
Figure 3. Schematic diagram of the overall model.
Algorithms 15 00025 g003
Table 1. Sample size.
Table 1. Sample size.
Number of Functions
Java542,991
Go347,789
PHP717,313
Python503,502
JavaScript157,988
Ruby57,393
Total2,326,976
Table 2. MRR for each monolingual model and MPLCS model on different language test sets.
Table 2. MRR for each monolingual model and MPLCS model on different language test sets.
CODE ENCODEQUERY ENCODEModelGoJavaJavascriptPhpPythonRuby
SELF-ATTSELF-ATTGO0.77560.54000.45910.45520.56490.4760
Java0.64850.66320.48060.53900.60470.5157
Javascript0.56880.51870.53040.44940.57190.4816
Php0.64320.60050.50680.64240.69150.5572
Python0.63970.56910.49680.56020.76130.5791
Ruby0.48490.43190.34940.36730.51670.4773
ALL0.73560.63500.52400.61910.71770.5717
MPLCS0.74720.64040.54920.60790.72890.5977
CNNCNNGO0.77800.55930.47670.48720.60020.4943
Java0.66910.67760.51060.56440.64460.5381
Javascript0.60380.55100.55460.48220.61480.5122
Php0.67180.61810.52570.65390.71520.5707
Python0.67830.59940.51400.56580.77480.5918
Ruby0.56280.47380.39370.42080.58020.5239
ALL0.74050.64580.53630.62680.73010.5805
MPLCS0.74510.65310.56560.62170.74570.6111
NBOWNBOWGO0.67770.51810.42560.40810.52800.4420
Java0.54080.59810.43540.44560.54140.4590
Javascript0.53120.48440.47990.41180.50310.4087
Php0.56450.53590.44420.55690.57200.4727
Python0.57460.52120.43340.44990.65600.4987
Ruby0.46450.42480.34650.34650.49770.4539
ALL0.64660.56600.46020.52510.61170.4911
MPLCS0.67100.58820.50240.52830.63890.5369
SELF-ATTNBOWGO0.75990.53150.45940.43790.55490.4647
Java0.63920.65710.48490.54230.60210.5110
Javascript0.57540.51360.53540.47400.56000.4687
Php0.63910.58990.49560.64240.66780.5436
Python0.63400.56460.49440.54730.75630.5646
Ruby0.48400.42020.34760.34460.50720.4704
ALL0.72660.62910.52710.61750.71270.5661
MPLCS0.74030.64290.56040.61480.73560.6022
Table 3. SuccessRate@k for each monolingual model and MPLCS model on different language test sets.
Table 3. SuccessRate@k for each monolingual model and MPLCS model on different language test sets.
SuccessRate@kCODE ENCODEQUERY ENCODEModelGoJavaJavascriptPhpPythonRuby
SuccessRate@1SELF-ATTSELF-ATTGo0.72330.45350.37500.36150.46450.3740
Java0.56290.58590.39800.44960.50640.4150
Javascript0.48280.43290.44000.36140.47150.3755
Php0.56300.52080.42600.56450.60170.4555
Python0.55940.48510.41580.47580.67810.4785
Ruby0.38860.34280.26480.27870.41560.3710
ALL0.67090.55420.43620.53620.62550.4700
MPLCS0.68000.56430.46700.52950.64340.5035
CNNCNNGo0.72380.47300.38910.38890.49650.3830
Java0.58620.59980.42290.47570.54530.4365
Javascript0.51540.46160.46270.39250.51200.4045
Php0.59330.53890.44340.57430.62400.4650
Python0.60060.51440.43010.47820.69150.4890
Ruby0.46710.38400.30320.32930.47940.4210
ALL0.67580.56510.44610.54310.63760.4790
MPLCS0.68010.56850.47230.53310.65100.5075
NBOWNBOWGo0.59340.42680.33200.31560.42540.3385
Java0.43990.50720.34120.34800.43920.3520
Javascript0.43590.39290.38250.31770.40180.3055
Php0.46940.44530.35280.46300.46860.3680
Python0.47960.42970.33970.35370.55370.3945
Ruby0.36730.33470.25920.25770.39550.3465
ALL0.55860.47280.36270.42860.50670.3825
MPLCS0.58490.49350.39730.42870.53120.4260
SELF-ATTNBOWGo0.70210.44250.37180.34290.45300.3620
Java0.55320.57910.39920.45430.50110.4015
Javascript0.48660.42620.44450.38530.45660.3595
Php0.55880.50950.41280.56450.57450.4430
Python0.55290.47920.41030.45890.67130.4630
Ruby0.38790.33120.26300.25600.40720.3635
ALL0.65820.54730.43780.53310.62010.4625
MPLCS0.67350.56000.47030.52710.64150.4945
SuccessRate@5SELF-ATTSELF-ATTGo0.83020.64180.55350.56260.68500.5980
Java0.74910.75670.57620.64500.72150.6335
Javascript0.67100.61800.63700.54990.68980.6075
Php0.73880.69470.60150.73560.80020.6770
Python0.73730.66840.58700.66120.86390.7000
Ruby0.59250.53030.44070.46610.63430.6070
ALL0.80660.73270.62450.71790.83190.6855
MPLCS0.81710.74410.66170.72140.85240.7290
CNNCNNGo0.83540.66170.57780.60410.72530.6290
Java0.76850.77150.61130.66960.76480.6600
Javascript0.70830.65620.66430.58540.73830.6380
Php0.76340.71150.61890.74930.82790.6925
Python0.77010.70150.61230.67090.87760.7175
Ruby0.67290.57710.49910.52290.69960.6490
ALL0.81190.74310.64110.72840.84550.7030
MPLCS0.81790.75580.67780.73060.86560.7450
NBOWNBOWGo0.77290.62470.53580.51360.64760.5650
Java0.65610.70630.54230.55810.66040.5835
Javascript0.63990.58980.59200.51900.61900.5195
Php0.67440.64130.55020.66770.69540.5930
Python0.68330.62930.53500.56320.78150.6220
Ruby0.57230.52630.44030.44420.61500.5865
ALL0.74920.67730.57700.64010.73800.6200
MPLCS0.77130.70480.62870.64750.77310.6770
SELF-ATTNBOWGo0.82130.63530.55880.54890.67360.5850
Java0.73910.74910.58100.64500.72030.6365
Javascript0.67740.61500.64020.57640.68210.5915
Php0.73290.68400.58930.73560.78010.6625
Python0.72820.66360.59120.65230.86200.6865
Ruby0.59120.52100.43620.44080.62210.5995
ALL0.80380.72690.62850.71860.82900.6905
MPLCS0.81630.74380.66670.72090.85370.7325
SuccessRate@10SELF-ATTSELF-ATTGo0.85810.69700.61000.63010.74760.664
Java0.79360.79980.63250.69950.78260.6945
Javascript0.72290.67490.69670.61290.75630.6755
Php0.78310.74050.65500.77590.84860.7355
Python0.78090.72150.64750.71540.90300.751
Ruby0.66110.59950.51250.53550.70580.6785
ALL0.83990.77770.68320.76360.87550.7455
MPLCS0.84940.79100.72870.77030.89690.7930
CNNCNNGo0.86220.71460.63180.66720.78910.6945
Java0.81040.81310.66940.72360.82150.7260
Javascript0.76260.71140.72030.64920.79870.7020
Php0.80540.75810.67710.78900.87220.7620
Python0.81010.75070.67080.72340.91550.7710
Ruby0.73730.64060.56240.58950.76780.7240
ALL0.84470.78980.70080.77410.89060.7640
MPLCS0.85230.80670.74100.78030.91170.8040
NBOWNBOWGo0.82080.68410.60330.58320.71960.6280
Java0.72530.76240.60770.62800.73200.6595
Javascript0.70250.65460.65930.58840.69020.5965
Php0.73490.70030.61650.72910.76220.6660
Python0.74490.69000.60680.63210.84220.6920
Ruby0.64480.59130.51050.51530.68940.6565
ALL0.80020.73610.64530.70340.80300.7045
MPLCS0.82170.76260.69750.71560.83970.7455
SELF-ATTNBOWGo0.85080.69190.61980.61760.73950.6670
Java0.78960.79340.63720.70000.78310.7010
Javascript0.73280.67320.70170.64210.74650.6570
Php0.77890.73220.64570.77550.83190.7205
Python0.77790.71830.64880.70690.89940.7480
Ruby0.65940.58580.50630.51260.69460.6775
ALL0.83670.77450.68920.76370.87500.7465
MPLCS0.84820.79180.73120.77150.90110.7865
Table 4. The effect of λ on the student model.
Table 4. The effect of λ on the student model.
LambdaGoJavaJavascriptPhpPythonRubyAvg
0.00.73040.63650.55200.60410.72080.59250.6535
0.10.73860.64150.55670.60860.72740.59570.6591
0.20.73850.64440.55940.60840.73200.59800.6611
0.30.73450.64430.56020.61300.73620.60050.6628
0.40.73930.64600.56170.61380.73540.60330.6642
0.50.73940.64690.56020.61840.74000.60890.6668
0.60.73860.64710.55790.61650.73840.60240.6655
0.70.74700.64690.55630.61930.73640.60130.6669
0.80.74000.64730.56210.61910.73910.60590.6670
0.90.74210.64450.55330.61560.73570.60460.6642
1.00.74530.64560.55610.61550.73560.60420.6651
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, W.; Xu, J.; Chen, Q. Knowledge Distillation-Based Multilingual Code Retrieval. Algorithms 2022, 15, 25. https://doi.org/10.3390/a15010025

AMA Style

Li W, Xu J, Chen Q. Knowledge Distillation-Based Multilingual Code Retrieval. Algorithms. 2022; 15(1):25. https://doi.org/10.3390/a15010025

Chicago/Turabian Style

Li, Wen, Junfei Xu, and Qi Chen. 2022. "Knowledge Distillation-Based Multilingual Code Retrieval" Algorithms 15, no. 1: 25. https://doi.org/10.3390/a15010025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop