Visual Exploration of Stopword Probabilities in Topic Models

Research output: Working paperPreprint

1 Downloads (Pure)

Abstract

Stopword removal is a critical stage in many Machine Learning methods but often receives little consideration, it interferes with the model visualizations and disrupts user confidence. Inappropriately chosen or hastily omitted stopwords not only lead to suboptimal performance but also significantly affect the quality of models, thus reducing the willingness of practitioners and stakeholders to rely on the output visualizations. This paper proposes a novel extraction method that provides a corpus-specific probabilistic estimation of stopword likelihood and an interactive visualization system to support their analysis. We evaluated our approach and interface using real-world data, a commonly used Machine Learning method (Topic Modelling), and a comprehensive qualitative experiment probing user confidence. The results of our work show that our system increases user confidence in the credibility of topic models by (1) returning reasonable probabilities, (2) generating an appropriate and representative extension of common stopword lists, and (3) providing an adjustable threshold for estimating and analyzing stopwords visually. Finally, we discuss insights, recommendations, and best practices to support practitioners while improving the output of Machine Learning methods and topic model visualizations with robust stopword analysis and removal.
Original languageEnglish
PublisherarXiv
DOIs
Publication statusPublished - 17 Jan 2025

Keywords

  • Visualisation
  • Topic Modelling
  • HCI
  • User Study
  • Stopwords
  • Machine learning

Fingerprint

Dive into the research topics of 'Visual Exploration of Stopword Probabilities in Topic Models'. Together they form a unique fingerprint.

Cite this