Abstract
In this paper, we propose a novel approach for context-aware spell correction in text documents. We present a deep learning model that learns a context-aware character-level mapping of words to a compact embedding space. In the embedding space, a word and its spelling variations are mapped close to each other in a Euclidean space. After we develop this mapping for all words in the dataset’s vocabulary, it is possible to identify and correct wrongly spelt words by comparing the distances of their mappings with those of the correctly spelt words. The word embeddings are built in a way that captures the context of each word. This makes it easier for our system to identify correctly spelt words that are used out of their contexts (e.g., their/there, your/you’re). The Euclidean distance, between our word embeddings, can thus be deemed as a context-aware string similarity metric. We employ a transformer-encoder model that takes character-level input of words and their context to achieve this. The embeddings are generated as the outputs of the model. The model is then trained to minimize triplet loss, which ensures that spell variants of a word are embedded close to the word, and that unrelated words are embedded farther away. We further improve the efficiency of the training by using a hard triplet mining approach. Our approach was inspired by FaceNet [18], where the authors developed a similar approach for face recognition and clustering using embeddings generated from Convolutional Neural Networks. The results of our experiments show that our approach is effective in spell check applications.
Original language | English |
---|---|
Title of host publication | Artificial Neural Networks in Pattern Recognition. ANNPR 2022 |
Editors | Neamat El Gayar, Edmondo Trentin, Mirco Ravanelli, Hazem Abbas |
Publisher | Springer |
Pages | 129-139 |
Number of pages | 11 |
ISBN (Electronic) | 9783031206504 |
ISBN (Print) | 9783031206498 |
DOIs | |
Publication status | Published - 2023 |
Event | 10th IAPR TC3 Workshop on Artificial Neural Networks in Pattern Recognition 2022 - Dubai, United Arab Emirates Duration: 24 Nov 2022 → 26 Nov 2022 https://annpr2022.com/ |
Publication series
Name | Lecture Notes in Computer Science |
---|---|
Volume | 13739 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 10th IAPR TC3 Workshop on Artificial Neural Networks in Pattern Recognition 2022 |
---|---|
Abbreviated title | ANNPR 2022 |
Country/Territory | United Arab Emirates |
City | Dubai |
Period | 24/11/22 → 26/11/22 |
Internet address |
Keywords
- Multi-headed attention
- Spell correction
- Transformer
- Triplet loss
ASJC Scopus subject areas
- Theoretical Computer Science
- General Computer Science