Subunits Inference and Lexicon Development Based on Pairwise Comparison of Utterances and Signs

Sandrine Tornay and Mathew Magimai.-Doss

Resumen

Communication languages convey information through the use of a set of symbols or units. Typically, this unit is word. When developing language technologies, as words in a language do not have the same prior probability, there may not be sufficient training data for each word to model. Furthermore, the training data may not cover all possible words in the language. Due to these data sparsity and word unit coverage issues, language technologies employ modeling of subword units or subunits, which are based on prior linguistic knowledge. For instance, development of speech technologies such as automatic speech recognition system presume that there exists a phonetic dictionary or at least a writing system for the target language. Such knowledge is not available for all languages in the world. In that direction, this article develops a hidden Markov model-based abstract methodology to extract subword units given only pairwise comparison between utterances (or realizations of words in the mode of communication), i.e., whether two utterances correspond to the same word or not. We validate the proposed methodology through investigations on spoken language and sign language. In the case of spoken language, we demonstrate that the proposed methodology can lead up to discovery of phone set and development of phonetic dictionary. In the case of sign language, we demonstrate how hand movement information can be effectively modeled for sign language processing and synthesized back to gain insight about the derived subunits.

Palabras claves

subword units - phone set - pronunciation lexicon - hidden Markov model - under-resourced - speech processing - sign language processing

Acceso

P�GINAS

pp. 0 - 0

N�MERO

Volumen: 10 Parte: 10 (2019)

MATERIAS

INGENIER�A Y CONSTRUCCI�N CIVIL
TECNOLOG�A

REVISTAS SIMILARES

Information
Informatics
Applied Sciences

DOI

https://doi.org/10.3390/info10100298

Art�culos similares

Automatic Translation between Mixtec to Spanish Languages Using Neural Networks

Acceso

Hermilo Santiago-Benito , Diana-Margarita C�rdova-Esparza , No�-Alejandro Castro-S�nchez , Teresa Garc�a-Ramirez , Julio-Alejandro Romero-Gonz�lez and Juan Terven

This paper introduces a novel method for collecting and translating texts from the Mixtec to the Spanish language. The method comprises four primary steps. First, we collected a Mixtec?Spanish corpus that includes 4568 sentences from educational and reli... ver m�s

Revista: Applied Sciences

CWSXLNet: A Sentiment Analysis Model Based on Chinese Word Segmentation Information Enhancement

Acceso

Shiqian Guo, Yansun Huang, Baohua Huang, Linda Yang and Cong Zhou

This paper proposed a method for improving the XLNet model to address the shortcomings of segmentation algorithm for processing Chinese language, such as long sub-word lengths, long word lists and incomplete word list coverage. To address these issues, w... ver m�s

Revista: Applied Sciences

Features of Computer Morphological Analysis and Synthesis of verbs of the Tajik language

Acceso

Navruz Madibragimov P�g. 79 - 86

Today, computational linguistics of the Tajik language is at the origin of its development. In order to develop this area, the author of this article is developing a project for the formalization of inflections of the Tajik language for computer morpholo... ver m�s

Revista: International Journal of Open Information Technologies

Machine Translation of Electrical Terminology Constraints

Acceso

Zepeng Wang, Yuan Chen and Juwei Zhang

In practical applications, the accuracy of domain terminology translation is an important criterion for the performance evaluation of domain machine translation models. Aiming at the problem of phrase mismatch and improper translation caused by word-by-w... ver m�s

Revista: Information

Prompt-Based Word-Level Information Injection BERT for Chinese Named Entity Recognition

Acceso

Qiang He, Guowei Chen, Wenchao Song and Pengzhou Zhang

Named entity recognition (NER) is a subfield of natural language processing (NLP) that identifies and classifies entities from plain text, such as people, organizations, locations, and other types. NER is a fundamental task in information extraction, inf... ver m�s

Revista: Applied Sciences

Revistas destacadas

Acceso directo a los n�meros publicados en la revista Infrastructures

Infrastructures

Acceso directo a los n�meros publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los n�meros publicados en la revista BiT

Acceso directo a los n�meros publicados en la revista Revista de la Construcci�n

Revista de la Construcci�n

Ver todas las revistas