Transformer-Encoder Generated Context-Aware Embeddings for Spell Correction

Noufal Samsudin*, Hani Ragab Hassan

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we propose a novel approach for context-aware spell correction in text documents. We present a deep learning model that learns a context-aware character-level mapping of words to a compact embedding space. In the embedding space, a word and its spelling variations are mapped close to each other in a Euclidean space. After we develop this mapping for all words in the dataset’s vocabulary, it is possible to identify and correct wrongly spelt words by comparing the distances of their mappings with those of the correctly spelt words. The word embeddings are built in a way that captures the context of each word. This makes it easier for our system to identify correctly spelt words that are used out of their contexts (e.g., their/there, your/you’re). The Euclidean distance, between our word embeddings, can thus be deemed as a context-aware string similarity metric. We employ a transformer-encoder model that takes character-level input of words and their context to achieve this. The embeddings are generated as the outputs of the model. The model is then trained to minimize triplet loss, which ensures that spell variants of a word are embedded close to the word, and that unrelated words are embedded farther away. We further improve the efficiency of the training by using a hard triplet mining approach. Our approach was inspired by FaceNet [18], where the authors developed a similar approach for face recognition and clustering using embeddings generated from Convolutional Neural Networks. The results of our experiments show that our approach is effective in spell check applications.

Original languageEnglish
Title of host publicationArtificial Neural Networks in Pattern Recognition. ANNPR 2022
EditorsNeamat El Gayar, Edmondo Trentin, Mirco Ravanelli, Hazem Abbas
PublisherSpringer
Pages129-139
Number of pages11
ISBN (Electronic)9783031206504
ISBN (Print)9783031206498
DOIs
Publication statusPublished - 2023
Event10th IAPR TC3 Workshop on Artificial Neural Networks in Pattern Recognition 2022 - Dubai, United Arab Emirates
Duration: 24 Nov 202226 Nov 2022
https://annpr2022.com/

Publication series

NameLecture Notes in Computer Science
Volume13739
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference10th IAPR TC3 Workshop on Artificial Neural Networks in Pattern Recognition 2022
Abbreviated titleANNPR 2022
Country/TerritoryUnited Arab Emirates
CityDubai
Period24/11/2226/11/22
Internet address

Keywords

  • Multi-headed attention
  • Spell correction
  • Transformer
  • Triplet loss

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Transformer-Encoder Generated Context-Aware Embeddings for Spell Correction'. Together they form a unique fingerprint.

Cite this