K-EPIC: Entity-Perceived Context Representation in Korean Relation Extraction

Hur, Yuna; Son, Suhyune; Shim, Midan; Lim, Jungwoo; Lim, Heuiseok

doi:10.3390/app112311472

Open AccessArticle

K-EPIC: Entity-Perceived Context Representation in Korean Relation Extraction

¹

Department of Computer Science and Engineering, Korea University, 145, Anam-ro, Seongbuk-gu, Seoul 02841, Korea

²

Human-Inspired AI Research, Korea University, 145, Anam-ro, Seongbuk-gu, Seoul 02841, Korea

³

Department of Software Convergence, Kyung Hee University, Yongin 17104, Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2021, 11(23), 11472; https://doi.org/10.3390/app112311472

Submission received: 4 November 2021 / Revised: 24 November 2021 / Accepted: 25 November 2021 / Published: 3 December 2021

Download

Browse Figures

Versions Notes

Abstract

:

Relation Extraction (RE) aims to predict the correct relation between two entities from the given sentence. To obtain the proper relation in Relation Extraction (RE), it is significant to comprehend the precise meaning of the two entities as well as the context of the sentence. In contrast to the RE research in English, Korean-based RE studies focusing on the entities and preserving Korean linguistic properties rarely exist. Therefore, we propose K-EPIC (Entity-Perceived Context representation in Korean) to ensure enhanced capability for understanding the meaning of entities along with considering linguistic characteristics in Korean. We present the experimental results on the BERT-Ko-RE and KLUE-RE datasets with four different types of K-EPIC methods, utilizing entity position tokens. To compare the ability of understanding entities and context of Korean pre-trained language models, we analyze HanBERT, KLUE-BERT, KoBERT, KorBERT, KoELECTRA, and multilingual-BERT (mBERT). The experimental results demonstrate that the F1 score increases significantly with our K-EPIC and that the performance of the language models trained with the Korean corpus outperforms the baseline.

Keywords:

information extraction; relation extraction; Korean pre-trained language model; deep learning

1. Introduction

The importance of research on automatic information extraction is recently increasing [1] with the advent of massive unstructured documents. Information Extraction (IE), which provides the basic research for extracting structured information from unstructured resources, is considered promising research in the field of natural language processing (NLP). Among principal research in IE [2], Relation Extraction (RE) aims to predict the relation between two entities in a single sentence. RE is a significant task especially in the field of Knowledge Base Population (KBP) since it extracts structured triples. Additionally, it is used for advanced research, including Question and Answering (QA) systems, Summarization, Dialogue Systems, and Information Retrieval (IR) [3].

To obtain the final prediction form as (entity1, relation, entity2) in RE tasks, it is important to determine the precise meaning of two entities along with the context of the sentence [2]. For example, as shown in Figure 1, it is much easier to predict the relation “org: place_of_headquarters” if a person already knows the meaning of the subject “한국방송공사에서 (Korea Broadcasting Corporation )” marked in orange and “대한민국의 (South Korea)”, the object entity marked in purple. In a recent study, pre-trained language models trained on the large corpus to capture contextual information, such as BERT [4], show considerable performance in various NLP tasks including RE. In the ways of utilizing pre-trained language model in RE tasks, one represents entities by aligning sentence and entities separated with [SEP] token [5], or another replaces entities with corresponding Named Entity Recognition (NER) tags [6]. The former implicitly represents the sentences without any entity-specific tokens, and the latter merely replaces two entities with specific tokens, such as <entity 1>, resulting in losing their meaning. To overcome these problems, Soares et al. [7] utilizes the explicit representation of entities and predicts the relation on an English RE dataset [8,9].

With the previous RE research in English, BERT-Ko-RE [10] and KLUE-RE [11], which are recently published data on Korean RE, elevate the level of research in Korean. In addition, Nam et al. [10] demonstrates the performance of multilingual-BERT (mBERT) [4] trained with the corpus in 104 languages. However, a limited number of RE studies considering two linguistic properties of Korean which are different from English exist. One is that the role of the word in most Korean sentences is decided when the postposition is fully combined [12]. For instance, by combining ‘총리 (root word; the prime minister)’ and ‘는 (postposition)’, the word ‘총리는’ finally acts as the subject. Therefore, it is important to extract the comprehensive meaning of the word which is a combined form of root word and postposition. The other property is the free word order in Korean sentences. Contrary to English, the order of words is not a substantial obstacle to comprehending the meaning of the sentence, since the role of phrase or word is decided according to the types of a postposition [13]. Owing to the aforementioned characteristics, language models that have not been trained with Korean corpus exhibit limited performance on Korean RE.

In this paper, we propose entity-perceived context representation and Korean language models. To apply our K-EPIC method to the Korean pre-trained language models, the entities of the dataset are marked with entity position tokens. Since we aim to predict relation by capturing the meaning of entities considering the linguistic properties especially in Korean, we demonstrate our experiments on the BERT-Ko-RE-dataset [10] and KLUE-RE dataset [11]. We conduct experiments with Korean language models, including HanBERT (https://github.com/monologg/HanBert-Transformers (accessed on 1 March 2021)), KLUE-BERT [11], KoBERT(https://github.com/SKTBrain/KoBERT (accessed on 1 March 2021)), KorBERT (https://aiopen.etri.re.kr/service_dataset.php (accessed on 1 March 2021)), and KoELECTRA [14] to enhance their capability of understanding context and entity representations simultaneously. By utilizing language models trained on Korean corpus, our model exhibits better performance than those of previous methods while preserving linguistic characteristics of Korean in Relation Extraction (RE) tasks. We also analyze the results of mBERT to compare the capability of comprehending Korean as well in an empirical way. To the best of our knowledge, this is the first work that proposes an entity aware method in RE achieving improved performance on five Korean language models.

The contributions of our work are summarized as follows:

We propose K-EPIC, four different types of representing entities in Korean Relation Extraction.
We apply our K-EPIC method to the five Korean Pre-trained Language Models (PLMs) and analyze each result empirically.
Significant improvements are shown in the experiments when we apply our K-EPIC method to the Korean PLMs.

2. Related Works

2.1. Relation Extraction

RE (Relation Extraction) is the task of obtaining triple elements (entity1, relation, entity2) by predicting the relation when the single sentence and two entities are given. In English, feature-based methods [15,16,17] that do not use pre-trained language models classify the relation using Support Vector Machine (SVM) algorithm with the information of Part-of-Speech (POS) tagging or syntactic parse trees. Lin et al. [18] predict relations employing the word embeddings and position embeddings passed from a Convolutional Neural Network (CNN). Soares et al. [7], the first pre-trained model-based approach, utilizes special tokens to manage entity representations to predict relation from the sentence. Similarly, Zhou and Chen [6] exploits special tokens to mark entities and shows the impacts on Named Entity Recognition (NER) feature on relation representation. Moreover, this research described two ways of utilizing special tokens.

In previous Korean RE studies, Kim and Lee [19] extracts the entities from the sentence using Part of Speech (POS) tagging, Named Entity Recognition, and word embeddings. Then, using the result of dependency parsing from the constructed data, Bayesian probability is applied to predict the final relation. Kim et al. [20] directly constructs the Korean text about the history and predicts the relation using the Long Short-Term Memory (LSTM) model.

However, few studies have been done in the Korean language model-based method on the BERT-Ko-RE dataset as well as KLUE-RE. Therefore, we propose the K-EPIC method that enables us to analyze the impact of entity position tokens in diverse ways on RE tasks in Korean.

2.2. Pre-Trained Models

Pre-trained models trained with the large corpus show considerable performance recently in diverse Natural Language Processing (NLP) tasks. As a leading trend of pre-trained language models, BERT [4] uses Transformer encoder [21] with 3.3 billion tokens exploiting random masking strategy. Various models based on BERT have been state of the art in many downstream tasks, and BERT is still used to compare performance for evaluating as a baseline. ELECTRA [22] is proposed with a more efficient pre-training task which consists of a generator and discriminator network that are similar structures with Generative Adversarial Network (GAN). ELECTRA is much faster and efficient since it learns diverse features from the input tokens.

In Table 1, we denote the detailed information of Korean pre-trained language models including multilingual-BERT. HanBERT uses its own tokenizer (Moran) with 54,000 vocabulary words from 70 GB of Korean general documents and patent documents. KLUE-BERT is the pre-trained language model that is open to the public along with the KLUE dataset. KLUE-BERT is trained with morpheme-based subword tokenizer, 32,000 vocabulary words from [23] and 63 GB sentences including Modu corpus (https://corpus.korean.go.kr/ (accessed on 1 March 2021)), CC-100 (http://data.statmt.org/cc-100/ (accessed on 1 March 2021)), NamuWiki (https://namu.wiki/ (accessed on 1 March 2021)), newspaper, and other web sources. KoBERT trained on 5 M sentences from Korean Wikipedia (https://ko.wikipedia.org/ (accessed on 1 March 2021)) and uses SentencePiece [24] for tokenizer with 8002 vocabulary words. Additionally, KorBERT is trained with 23 GB of text from Korean news and an encyclopedia. KorBERT is also providing two different versions of tokenizers, which are wordpiece-based tokenizers (30,797 vocabulary words) and morpheme-based tokenizers (30,349 vocabulary words). KoELECTRA is proposed to improve previous Korean pre-trained language models with 32,200 vocabulary words. KoELECTRA uses a WordPiece tokenizer and is pre-trained with 34 GB of data from Korean Wikipedia, NamuWiki, newspaper, and Modu corpus.

2.3. Tokenizer in Korean Pre-Trained Language Models

Unlike English tokenization as in BERT, a tokenization strategy reflecting the linguistic characteristics of Korean is significant. In Korean, the word generally contains more information than the word in English since the word in Korean is in the combined form of the root word and postposition [25]. Postposition includes 은(eun), 는(neun), 이(i), 가(ga), 를(leul), and 의(ui). Owing to the fact that the word in a Korean sentence represents the meaning when the postposition is combined, the order of words has relatively low impact on also understanding the sentence. Due to the above characteristics, it is rather important to comprehensively understand the word while capturing the morphological relationship between the root word and postposition in Korean tokenization. Likewise, HanBERT, KLUE-BERT, and KorBERT properly tokenize the sentence as depicted in Table 2. Another way of understanding the word comprehensively without a morpheme-based tokenizer is training the language models in Korean only. For instance, mBERT and KoELCETRA tokenize sentences into subword units differently even though they are using the same WordPiece tokenizer strategy. Due to its difference in pre-training data, mBERT just divides ‘총리는’ into ‘총’ and ‘##리는’ failing to distinguish the root word from postposition. On the other hand, KoELECTRA splits of word ‘총리는’ into the root word ‘총리’ and postposition ‘##는’ correctly. While a mBERT tokenizer was trained with data in 104 languages, including Korean, the KoELECTRA tokenizer is trained on Korean Wikipedia, Namu Wiki, and other Korean documents focusing on the Korean language. As a result, the KoELECTRA tokenizer reflects the characteristics of Korean rather than that of mBERT. Likewise, HanBERT, KLUE-BERT, KoBERT, and KorBERT that trained with a Korean corpus separate the word ‘총리는’ into the root word ‘총리’ and postposition ‘##는’. Therefore, it is important to apply a morpheme-based tokenizer or utilize the Korean pre-trained language model to process the unique properties of Korean. To this end, we apply a morpheme-based tokenizer to our K-EPIC method with Korean pre-trained language models.

3. Proposed Method

We propose an K-EPIC method for Korean RE tasks as depicted in Figure 2. We preprocess datasets to indicate the position of entities and analyze the four different methods of utilizing the entities.

3.1. Task Definition

Consider the dataset

D = {(S_{0}, e 1_{0}, e 2_{0}, r), \dots, (S_{N}, e 1_{N}, e 2_{N}, r)}

, where N is the number of examples in dataset. Each sentence

S_{n}

is labeled with two entities

e 1

and

e 2

, and relation r∈

{r_{0}, \dots, r_{R}}

, where R is the total number of the relations. The sentence can be denoted as

S_{n}

=

[t_{1}^{n}, t_{2}^{n}, \dots, t_{m}^{n}]

.

t_{i}^{n}

is the i-th token of

S_{n}

, and m denotes the number of tokens from the tokenizer. Moreover,

t_{1}^{n}

and

t_{m}^{n}

are [CLS] token and [SEP] token, respectively. The entity position tokens, [

e 1_{SP}

], [

e 1_{EP}

], [

e 2_{SP}

], and [

e 2_{EP}

], are added at the start and end of both entities, i.e.,

e 1

and

e 2

. Therefore, the Sentence Input (SI) can be represented by Equation (1).

e 1

indicates the number of tokens in

e 1

; it can also be applied to

e 2

in the same manner. The position of

e 1

and

e 2

can be switched according to the relation type.

S I = [[CLS] + t_{1}, \dots, t_{i} + [e 1_{SP}] + e 1 + [e 1_{EP}] + t_{i + e 1 + 2}, \dots, t_{j} + [e 2_{SP}] + e 2 + [e 2_{EP}] + t_{j + e 2 + 2}, \dots, t_{m - 1} + [SEP]]

(1)

3.2. Relation Extraction with K-EPIC

We extend the methods of providing additional entity information in four ways, inspired by the work of Soares et al. [7]. Our four types of K-EPIC methods are illustrated in Figure 3. We put [

e 1_{SP}

] and [

e 1_{EP}

] at the start and end positions of the subject entity. Similarly, we place [

e 2_{SP}

] and [

e 2_{EP}

] at the start and end positions of the object entity, respectively. Because e1 does not always precede e2 in all sentences, our K-EPIC identifies the position of each entity and inserts the entity position tokens before and after the entity. Except for nonK-EPIC, we use this final preprocessed sentence,

S I

, as an input. Detailed information regarding each K-EPIC is explained below.

nonK-EPIC nonK-EPIC representation merely utilizes the [CLS] token of the input sentence in the final layer of language models. Note that nonK-EPIC inputs do not include any entity position tokens. As the hidden states barely include information about entity position, the models are reluctant to find the precise position of entities and their meanings explicitly. nonK-EPIC uses

h_{V}

as the final hidden representation, which is identical to the hidden state vector of the [CLS] token.

{K - EPIC}_{V}

Whereas nonK-EPIC is not created from

S I

, the vanilla method of our K-EPIC, K-EPIC_V processes [CLS] of the

S I

. Similar to nonK-EPIC, K-EPIC_V uses

h_{V}

as the final hidden representation. We can compare the effect of entity position tokens between nonK-EPIC and K-EPIC_V.

{K - EPIC}_{S}

K-EPIC

_{S}

exploits

h_{S}

, which represents the concatenation of the start position token of each entity, [

e 1_{SP}

] and [

e 2_{SP}

]. The concatenated vector is then fed into the linear projection layer to obtain the final relation. The representation enables the models to perceive entity information.

{K - EPIC}_{E}

In contrast to K-EPIC

_{S}

, K-EPIC

_{E}

uses only end position tokens, i.e., [

e 1_{EP}

] and [

e 2_{EP}

], in the same manner as K-EPIC

_{S}

. In K-EPIC

_{E}

, the final hidden representation,

h_{E}

, is the concatenation of these two end position tokens.

{K - EPIC}_{S E}

We also suggest the combination representation of K-EPIC

_{S}

and K-EPIC

_{E}

that utilizes all position tokens of the entities in

S I

, i.e., [

e 1_{SP}

], [

e 1_{EP}

], [

e 2_{SP}

], and [

e 2_{EP}

]. K-EPIC

_{S E}

uses

h_{S E}

, which is also the concatenated result of all entity position tokens. We analyzed the performance differences between these methods based on the number of entity position tokens.

In summary, the input of nonK-EPIC consists of only original sentence with [CLS] and [SEP] tokens. SI, which includes entity position tokens, is applied to K-EPIC

_{V}

, K-EPIC

_{S}

, K-EPIC

_{E}

, and K-EPIC

_{S E}

. The input embedding, segment embedding, and position embedding of the SI are used to create the representation. The final representation, h, is then fed into the linear layer and projected on the size of total relations. Consequently, the softmax result denotes the final probability for predicting the relation, as shown in Equation (2):

P r (r S I) = s o f t m a x (W^{T} \cdot h + b)

(2)

where W and b are learnable parameters.

4. Experiments

4.1. Datasets

We demonstrate our experiments with our K-EPIC method on two existing RE datasets. The statistics of each dataset can be found in Table 3.

4.1.1. BERT-Ko-RE Dataset

BERT-Ko-RE [10] is a Korean RE dataset annotated in two different ways. One is a crowdsourcing dataset created in compliance with the Gated Instruction (GI) protocol [26] and the other is made in a Distant-Supervision (DS) approach [27]. The crowdsourcing dataset only includes the data tagged with high agreement from the workers, and it guarantees high-quality notations, unlike DS data that contain a high rate of noise. The dataset contains pairs of a sentence and its corresponding relation label. Each relation is defined in an ontology schema with short English labels, such as country, knownFor, and part, etc. that are grounding daily domain. As described in Table 3, the train and test sets have 20,603 and 1838 pairs, respectively, with 49 relations.

4.1.2. KLUE-RE Dataset

The KLUE-RE dataset [11] is also a hand-crafted Korean RE dataset based on the Korean knowledge base (https://aihub.or.kr/aidata/84 (accessed on 1 March 2021)), Wikipedia, and NamuWiki. As the official test set is not yet disclosed, we use the development set as the test set. A single data example includes an original sentence, relation label, data source, and information indicating a subject or object. Entity labels contain common relations related to daily life, such as member_of, place_of_birth, and parents. The train and test sets have 32,470 and 7765 pairs, respectively, with 30 relations, as presented in Table 3.

4.2. Experimental Setting

We perform our experiments on the BERT-Ko-RE [10] and KLUE-RE datasets [11]. Using the proposed K-EPIC, we compare the capabilities of different pre-trained language models for RE tasks with different entity position tokens to demonstrate their enhanced performance. To ensure a high-quality dataset, we only use the hand-crafted datasets for the experiments. Our model trains on a single RTX 8000 GPU; the learning rate and batch size are set to be 2e-5 and 2, with 12 attention heads and 10 epochs. Moreover, we use the same parameters for all six models and take cross-entropy as the loss function.

4.3. Results and Analysis

We utilize two evaluation metrics, micro-F1 and weighted-F1. Most existing RE studies [3,28] use micro-F1 as an evaluation metric, since the datasets contain class-specific data imbalance. Furthermore, we also employ the weighted-F1 which considers the ratio of data by class. In this section, we introduce three different analyses in the respect of types of K-EPIC methods, language models, and tokenizers.

4.3.1. Comparison on the K-EPIC Method

The performance of the proposed K-EPIC on BERT-Ko-RE [10] and KLUE-RE dataset [11] are depicted in Table 4 and Table 5, respectively. First, the proposed K-EPIC

_{V}

method demonstrates a significant increase of 35.97%p for weighted-F1, on average, in comparison to nonK-EPIC. The difference illustrates that the entity position tokens enhance the RE performance. We also demonstrate that K-EPIC

_{S}

achieves the best performances in mBERT, KLUE-BERT, KoBERT, and KorBERT, among the other methods in all metrics. In the case of HanBERT, K-EPIC

_{S}

obtains the highest micro-F1, and K-EPIC

_{S E}

presents the highest weighted-F1, with 73.72 % and 78.62 %, respectively. KoELECTRA, which has a different model architecture, achieves the highest scores in K-EPIC

_{V}

.

Table 5 demonstrates the results on the KLUE-RE dataset. Similar to previous experimental results, the performance on our K-EPIC

_{V}

increases by 32.44%p for weighted-F1, on average, including all language models, in comparison to nonK-EPIC. Similarly, HanBERT, KLUE-BERT, and KoBERT show the best performance when they utilized K-EPIC

_{S}

. The rest of the language models perform better when they exploit K-EPIC

_{S E}

.

From our comparative analysis, we find that the inclusion of the start position token has a positive impact on the final performance. We assume that the closeness to the entity is related to the result. Since Korean postposition is usually located next to the root word, the start position token has more proximity of the entity in the input of the sentence [29]. As a result, K-EPIC

_{S}

mostly achieves the best performance among five Korean pre-trained language models, whereas the performance utilizing K-EPIC

_{S E}

may have been obstructed by the end position token obtaining dissimilar representation to the entity. These results show that our K-EPIC

_{S}

method makes efficient predictions with the entity position information.

4.3.2. Comparison on the Korean Language Model

To effectively compare the language models with K-EPIC

_{S}

and K-EPIC

_{S E}

which include start position token, Figure 4 illustrates the performance of each dataset. As shown in the above figures, it is evident that KLUE-BERT outperforms other language models on both datasets. Moreover, despite its promising performance, HanBERT has slight differences from the performance of KLUE-BERT, implying that the volume of pre-trained data has a primary impact on downstream tasks.

For the BERT-Ko-RE dataset, even though HanBERT trained the largest corpus than any other language model, the slightly low performance in RE tasks can be attributed to the type of data it trained. Since the patent documents usually include legal terminology for the protection of technologies [30], we assume that it has little impact on the RE tasks, which usually contain daily expressions.

For the KLUE-RE dataset, KLUE-BERT exhibits the best performance owing to the relevancy of its pre-trained data and the RE dataset. In addition, Korean-based language models such as KLUE-BERT exhibit comparable performances despite having less trained datasets than mBERT. Similar to the results of the BERT-Ko-RE dataset, HanBERT presents the second-highest score for micro-F1 because of its volume of pre-training data.

4.3.3. Comparison on Korean Tokenizers

Due to the fact that the Korean language is significantly different from English, the impact of tokenizers should be examined as well. KorBERT, which provides two different tokenizers on the same model architecture, we compare them under our K-EPIC method. One is the KorBERT model with word-level tokenizers and the other is that with the morphome-level tokenizers. We conduct the experiments on the both BERT-Ko-RE and KLUE-RE datasets with the K-EPIC method. As indicated in Table 6, the model with morpheme-level tokenizers achieves a significantly higher score than that with word-level tokenizers in both datasets. The results can be attributed to the linguistic property of the Korean language, which is an agglutinative language where words may contain different postpositions according to the role of the word [31]. Along with these results, K-EPIC

_{S E}

shows the best performance in comparison to other K-EPIC methods and nonK-EPIC in most cases.

5. Conclusions

In this paper, we propose K-EPIC, four different methods of representing entities in Korean Relation Extraction. To consider a Korean linguistic property as well as entity representation, we employ most Korean pre-trained language models and achieve enhanced performance rather than mBERT. The experimental results demonstrate that, by using entity position tokens, the capability of understanding the entities of pre-trained language models significantly improved. We evaluate the methods in each language model and find that K-EPIC

_{S}

shows the best performance in most language models. Finally, we present the experimental results on the comparison of different tokenizers indicating that morpheme-level tokenization impacts the performance effectively. In conclusion, our work may suggest that the performance of a model in Korean RE highly depends on the understanding level of the entities in sentences while preserving Korean linguistic properties.

Author Contributions

Conceptualization, Y.H.; methodology, Y.H.; software, S.S.; validation, S.S.; formal analysis, Y.H. and J.L.; investigation, M.S.; writing—original draft preparation/review and editing, Y.H., S.S., M.S. and J.L.; visualization, M.S. and J.L.; funding acquisition/project administration/supervision, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2021R1A6A1A03045425). Additionally, it was also supported by the MSIT (Ministry of Science and ICT), Korea, under the ICT Creative Consilience program (IITP-2021-2020-0-01819) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: “https://github.com/machinereading/bert-ko-re (accessed on 1 March 2021)” and “https://github.com/KLUE-benchmark/KLUE (accessed on 1 March 2021)”.

Conflicts of Interest

The authors declare no conflict of interest.

References

Pawar, S.; Palshikar, G.K.; Bhattacharyya, P. Relation extraction: A survey. arXiv 2017, arXiv:1712.05191. [Google Scholar]
Smirnova, A.; Cudré-Mauroux, P. Relation extraction using distant supervision: A survey. ACM Comput. Surv. (CSUR) 2018, 51, 1–35. [Google Scholar] [CrossRef]
Geng, Z.; Chen, G.; Han, Y.; Lu, G.; Li, F. Semantic relation extraction using sequential and tree-structured LSTM with attention. Inf. Sci. 2020, 509, 183–192. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers). pp. 4171–4186. [Google Scholar]
Yamada, I.; Asai, A.; Shindo, H.; Takeda, H.; Matsumoto, Y. Luke: Deep contextualized entity representations with entity-aware self-attention. arXiv 2020, arXiv:2010.01057. [Google Scholar]
Zhou, W.; Chen, M. An Improved Baseline for Sentence-level Relation Extraction. arXiv 2021, arXiv:2102.01373. [Google Scholar]
Soares, L.B.; Fitzgerald, N.; Ling, J.; Kwiatkowski, T. Matching the Blanks: Distributional Similarity for Relation Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 29–31 July 2019; pp. 2895–2905. [Google Scholar]
Hendrickx, I.; Kim, S.N.; Kozareva, Z.; Nakov, P.; Séaghdha, D.O.; Padó, S.; Pennacchiotti, M.; Romano, L.; Szpakowicz, S. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. arXiv 2019, arXiv:1911.10422. [Google Scholar]
Zhang, D.; Wang, D. Relation classification via recurrent neural network. arXiv 2015, arXiv:1508.01006. [Google Scholar]
Nam, S.; Lee, M.; Kim, D.; Han, K.; Kim, K.; Yoon, S.; Kim, E.K.; Choi, K.S. Effective Crowdsourcing of Multiple Tasks for Comprehensive Knowledge Extraction. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 212–219. [Google Scholar]
Park, S.; Moon, J.; Kim, S.; Cho, W.I.; Han, J.; Park, J.; Song, C.; Kim, J.; Song, Y.; Oh, T.; et al. KLUE: Korean Language Understanding Evaluation. arXiv 2021, arXiv:2105.09680. [Google Scholar]
Park, J.J.; Myaeng, S.H. A Method for Establishing Korean Multi-Word Concept Boundary Harnessing Dictionaries and Sentence Segmentation for Constructing Concept Graph; Korean Institute of Information Scientists and Engineers: Seoul, Korea, 2017; Volume 2017, pp. 651–653. [Google Scholar]
Shin, M.K. The resetting of the head direction parameter. Foreign Lang. Educ. Res. 2015, 18, 17–35. [Google Scholar]
Park, J. KoELECTRA: Pretrained ELECTRA Model for Korean. 2020. Available online: https://github.com/monologg/KoELECTRA (accessed on 1 March 2021).
Kambhatla, N. Combining lexical, syntactic, and semantic features with maximum entropy models for information extraction. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, Barcelona, Spain, 21–26 July 2004; pp. 178–181. [Google Scholar] [CrossRef]
Zhou, G.; Su, J.; Zhang, J.; Zhang, M. Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Stroudsburg, PA, USA, 25–30 June 2005; pp. 427–434. [Google Scholar]
Swampillai, K.; Stevenson, M. Extracting relations within and across sentences. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, Varna, Bulgaria, 2–4 September 2011; pp. 25–32. [Google Scholar]
Lin, Y.; Shen, S.; Liu, Z.; Luan, H.; Sun, M. Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 2124–2133. [Google Scholar]
Kim, B.; Lee, J.S. Extracting spatial entities and relations in Korean text. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 2389–2396. [Google Scholar]
Kim, K.; Hur, Y.; Kim, G.; Lim, H. GREG: A global level relation extraction with knowledge graph embedding. Appl. Sci. 2020, 10, 1181. [Google Scholar] [CrossRef] [Green Version]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1715–1725. [Google Scholar]
Kudo, T.; Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. arXiv 2018, arXiv:1808.06226. [Google Scholar]
Lee, S.; Jang, H.; Baik, Y.; Park, S.; Shin, H. Kr-bert: A small-scale korean-specific language model. arXiv 2020, arXiv:2008.03979. [Google Scholar]
Liu, A.; Soderland, S.; Bragg, J.; Lin, C.H.; Ling, X.; Weld, D.S. Effective crowd annotation for relation extraction. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 897–906. [Google Scholar] [CrossRef]
Mintz, M.; Bills, S.; Snow, R.; Jurafsky, D. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 2–7 August 2009; pp. 1003–1011. [Google Scholar] [CrossRef] [Green Version]
Wang, H.; Chen, M.; Zhang, H.; Roth, D. Joint Constrained Learning for Event-Event Relation Extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 696–706. [Google Scholar] [CrossRef]
Wang, B.; Shang, L.; Lioma, C.; Jiang, X.; Yang, H.; Liu, Q.; Simonsen, J.G. On position embeddings in bert. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Jeong, S. Formulierungsmuster in Patentschriften im Deutschen, Englischen und Koreanischen. Ger. Lit. Soc. 2016, 26, 231–249. [Google Scholar]
Yu, S.; Kulkarni, N.; Lee, H.; Kim, J. Syllable-level Neural Language Model for Agglutinative Language. arXiv 2017, arXiv:1708.05515. [Google Scholar]

Figure 1. Example of Korean RE. The translation of the sentence in Korean is “KBS 1 Radio is a radio channel operated and broadcasted by the Korea Broadcasting Corporation (Entity 1) in South Korea (Entity 2)”.

Figure 2. Overview of our K-EPIC process utilizing RE.

Figure 3. Four different types of final representations. In the example sentence, “Korea Broadcasting Corporation” is the subject entity and “South Korea” is the object entity. K-EPIC

_{V}

only utilizes the [CLS] token, whereas K-EPIC

_{S}

and K-EPIC

_{E}

utilize the start position token and end position token, respectively. K-EPIC

_{S E}

exploits all entity position tokens, which are the start position tokens and end position tokens of both entities.

Figure 3. Four different types of final representations. In the example sentence, “Korea Broadcasting Corporation” is the subject entity and “South Korea” is the object entity. K-EPIC

_{V}

only utilizes the [CLS] token, whereas K-EPIC

_{S}

and K-EPIC

_{E}

utilize the start position token and end position token, respectively. K-EPIC

_{S E}

exploits all entity position tokens, which are the start position tokens and end position tokens of both entities.

Figure 4. (a) Performance on each language model on the BERT-Ko-RE dataset. The green and pink bars indicate the performance of K-EPIC

_{S}

and K-EPIC

_{S E}

, respectively; (b) performance on each language model on the KLUE-RE dataset. The green and pink bars indicate the performance of K-EPIC

_{S}

and K-EPIC

_{S E}

, respectively.

Figure 4. (a) Performance on each language model on the BERT-Ko-RE dataset. The green and pink bars indicate the performance of K-EPIC

_{S}

and K-EPIC

_{S E}

, respectively; (b) performance on each language model on the KLUE-RE dataset. The green and pink bars indicate the performance of K-EPIC

_{S}

and K-EPIC

_{S E}

, respectively.

Table 1. Comparison of different language models.

Model	Model Size	Types of Pre-Trained Data	Pre-Trained Data Size	Tokenizers	Vocab Size
mBERT	178 M	Wikipedia pages in 104 languages	2.5 B words	WordPiece	119,547
HanBERT	128 M	General Document, Patent Document	70 GB (11.3 B morphemes)	Moran	54,000
KLUE-BERT	111 M	Modu corpus, CC-100-Kor, Namu Wiki, News Crawl, Petition	63 GB	Morpheme-based subword tokenizer	32,000
KoBERT	92 M	Korean Wikipedia	5 M sentences, 54 M words	SentencePiece	8002
KorBERT	110 M	Newspaper, Encyclopedia	23 GB (4.7 B morphemes)	Morpheme/WordPiece	30,349 (morphemes), 30,797 (wordpiece)
KoELECTRA	112 M	Korean Wikipedia, Namu Wiki, Newspaper, Messages, Web, etc.	34 GB	WordPiece	35,000

Table 2. Comparison on the results of tokenization from each model. #, ∼, and _ denote the boundary symbol of each tokenizer.

[ $e 1_{SP}$ ] 총리는 [ $e 1_{EP}$ ] 내각의 수장이자 [ $e 2_{SP}$ ] 정부의 [ $e 2_{EP}$ ] 수장이다. [ $e 1_{SP}$ ] The Prime Minister [ $e 1_{EP}$ ] is the Head of the Cabinet and the Head of [ $e 2_{SP}$ ] Government [ $e 2_{EP}$ ].
BERT	’[ $e 1_{SP}$ ]’, ’The’, ’prime’, ’minister’, ’[ $e 1_{EP}$ ]’, ’is’, ’the’, ’head’, ’of’, ’the’, ’cabinet’, ’and’, ’the’, ’head’, ’of’, ’[ $e 2_{SP}$ ]’, ’government’, ’[ $e 2_{EP}$ ]’, ’.’
mBERT	’[ $e 1_{SP}$ ]’, ’총’, ’##리는’, ’[ $e 1_{EP}$ ]’, ’내’, ’##각’, ’##의’, ’수’, ’##장이’, ’##자’, ’[ $e 2_{SP}$ ]’, ’정’, ’##부의’, ’[ $e 2_{EP}$ ]’, ’수’, ’##장이’, ’##다’, ’.’
HanBERT	’[ $e 1_{SP}$ ]’, ’총리’, ’∼ ∼는’, ’[ $e 1_{EP}$ ]’, ’내각’, ’∼ ∼의’, ’수장’, ’##이자’, ’[ $e 2_{SP}$ ]’, ’정부’, ’∼ ∼의’, ’[ $e 2_{EP}$ ]’,’수장’, ’##이다’, ’.’
KLUE-BERT	’[ $e 1_{SP}$ ]’, ’총리’, ’##는’, ’[ $e 1_{EP}$ ]’, ’내각’, ’##의’, ’수장’, ’##이’, ’##자’, ’[ $e 2_{SP}$ ]’, ’정부’, ’##의’, ’[ $e 2_{EP}$ ]’, ’수장’, ’##이다’, ’.’
KoBERT	’[ $e 1_{SP}$ ]’, ’_총리’, ’는’, ’[ $e 1_{EP}$ ]’, ’_내’, ’각’, ’의’, ’_수’, ’장이’, ’자’, ’[ $e 2_{SP}$ ]’, ’_정부의’, ’[ $e 2_{EP}$ ]’, ’_수’, ’장’, ’이다’, ’.’
KorBERT	’[ $e 1_{SP}$ ]’, ’총리/NNG’, ’는/JX’, ’[ $e 1_{EP}$ ]’, ’내각/NNG’, ’의/JKG, ’수장/NNG’, ’이/VCP’, ’자/EC’, ’[ $e 2_{SP}$ ]’, ’정부/NNG’, ’의/JKG’, ’[ $e 2_{EP}$ ]’, ’수장/NNG’, ’이/VCP’, ’다/EF’, ’./SF’
KoELECTRA	’[ $e 1_{SP}$ ]’, ’총리’, ’##는’, ’[ $e 1_{EP}$ ]’, ’내각’, ’##의’, ’수장’, ’##이’, ’##자’, ’[ $e 2_{SP}$ ]’, ’정부’, ’##의’, ’[ $e 2_{EP}$ ]’, ’수장’, ’##이다’, ’.’

Table 3. Statistics of RE datasets in Korean.

	BERT-Ko-RE		KLUE-RE
	Train	Test	Train	Test
Number of Sentences	20,603	1838	32,470	7765
Number of Relations	49	29	30	30

Table 4. Experimental results on the BERT-Ko-RE dataset on six language models within the K-EPIC method and average %p is calculated by subtracting nonK-EPIC from the average of weighted-F1 score of all of the language models (the best performing part is additionally marked in bold).

Data	BERT-Ko-RE dataset
Model	mBERT		HanBERT		KLUE-BERT		KoBERT		KorBERT		KoELECTRA
Metric (F1)	Micro	Weighted	Micro	Weighted	Micro	Weighted	Micro	Weighted	Micro	Weighted	Micro	Weighted	Avg.%p
nonK-EPIC	39.39	39.41	39.55	39.09	39.39	37.95	39.45	38.32	35.43	34.26	39.55	38.27	+0.00
K-EPIC $_{V}$	70.40	74.91	69.15	74.97	71.76	76.24	63.06	67.75	68.88	73.84	70.24	75.41	+35.97
K-EPIC $_{S}$	73.56	78.88	73.72	78.50	77.53	81.43	73.34	78.07	72.13	77.14	68.61	75.21	+40.32
K-EPIC $_{E}$	69.97	75.65	70.78	75.27	72.14	76.49	66.21	71.90	68.48	73.27	67.41	72.34	+36.27
K-EPIC $_{S E}$	70.29	75.80	73.07	78.62	74.86	80.11	69.53	75.18	67.94	73.90	68.23	73.07	+38.23

Table 5. Experimental Results on the KLUE-RE dataset on six language models within the K-EPIC method and average %p is calculated by subtracting nonK-EPIC from the average of weighted-F1 score of all of the language models (the best performing part is additionally marked in bold).

Data	KLUE-RE Dataset
Model	mBERT		HanBERT		KLUE-BERT		KoBERT		KorBERT		KoELECTRA
Metric (F1)	Micro	Weighted	Micro	Weighted	Micro	Weighted	Micro	Weighted	Micro	Weighted	Micro	Weighted	Avg.%p
nonK-EPIC	20.91	20.34	23.64	22.84	23.54	23.97	22.84	21.81	21.62	21.94	23.03	22.31	+0.00
K-EPIC $_{V}$	53.43	52.66	56.23	55.44	60.57	59.29	54.39	53.33	56.00	57.10	50.89	50.03	+32.44
K-EPIC $_{S}$	54.27	53.71	59.43	58.92	61.23	60.77	56.57	56.05	57.79	58.38	57.80	57.24	+35.31
K-EPIC $_{E}$	55.18	53.70	57.82	58.06	60.75	59.69	54.75	54.77	57.46	58.26	58.02	57.82	+34.85
K-EPIC $_{S E}$	55.78	54.59	59.01	57.79	60.31	59.79	55.82	54.88	58.61	58.54	59.29	58.51	+35.15

Table 6. Experimental results between different tokenizers on KorBERT. Underlined text denotes the highest score of the KorBERT with a morpheme-based tokenizer and bold text indicates the highest score among the methods of utilizing K-EPIC.

Data	BERT-Ko-RE Dataset				KLUE-RE Dataset
Model	KorBERT w/Wordpiece		KorBERT w/Morpheme		KorBERT w/Wordpiece		KorBERT w/Morpheme
Metric (F1)	Micro	Weighted	Micro	Weighted	Micro	Weighted	Micro	Weighted
nonK-EPIC	32.26	33.89	35.43	34.26	14.68	14.14	21.62	21.94
K-EPIC $_{V}$	58.32	61.86	68.88	73.84	31.67	30.46	56.00	57.10
K-EPIC $_{S}$	60.45	64.43	72.13	77.14	32.97	31.97	57.79	58.38
K-EPIC $_{E}$	57.02	59.60	68.48	73.27	33.82	32.38	57.46	58.26
K-EPIC $_{S E}$	61.59	65.69	67.94	73.90	34.40	32.50	58.61	58.54

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hur, Y.; Son, S.; Shim, M.; Lim, J.; Lim, H. K-EPIC: Entity-Perceived Context Representation in Korean Relation Extraction. Appl. Sci. 2021, 11, 11472. https://doi.org/10.3390/app112311472

AMA Style

Hur Y, Son S, Shim M, Lim J, Lim H. K-EPIC: Entity-Perceived Context Representation in Korean Relation Extraction. Applied Sciences. 2021; 11(23):11472. https://doi.org/10.3390/app112311472

Chicago/Turabian Style

Hur, Yuna, Suhyune Son, Midan Shim, Jungwoo Lim, and Heuiseok Lim. 2021. "K-EPIC: Entity-Perceived Context Representation in Korean Relation Extraction" Applied Sciences 11, no. 23: 11472. https://doi.org/10.3390/app112311472

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

K-EPIC: Entity-Perceived Context Representation in Korean Relation Extraction

Abstract

1. Introduction

2. Related Works

2.1. Relation Extraction

2.2. Pre-Trained Models

2.3. Tokenizer in Korean Pre-Trained Language Models

3. Proposed Method

3.1. Task Definition

3.2. Relation Extraction with K-EPIC

4. Experiments

4.1. Datasets

4.1.1. BERT-Ko-RE Dataset

4.1.2. KLUE-RE Dataset

4.2. Experimental Setting

4.3. Results and Analysis

4.3.1. Comparison on the K-EPIC Method

4.3.2. Comparison on the Korean Language Model

4.3.3. Comparison on Korean Tokenizers

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI