A Statistical Analysis of Text: Embeddings, Properties and Time-Series Modeling

Ioannis Chalkiadakis, Gareth Peters, Michael John Chantler, Ioannis Konstas

Research output: Working paperPreprint

Abstract

In this research we study the fundamental structure of written natural language. We begin by constructing text stochastic embeddings such that we preserve the key features of natural language: the time-dependence, semantics, and the laws of grammar and syntax. We present in detail how to clean and prepare the raw text for processing, and then use a common infrastructure of $N$-ary relations to construct all our target embeddings: (1) we construct a time-series representation of the well-known Bag-of-Words (BoW) model to capture the frequency characteristics of text processing units (e.g. words), (2) we utilise the context-free grammar formalism that produces a tree valued time-series indexed by the $n$-gram index, (3) we employ the dependency grammar formalism that produces a graph valued time-series indexed by the $n$-gram index, and (4) we combine BoW and syntax to construct an embedding of word co-occurrence frequencies under restrictions to particular syntactic structures. In considering the observed $N$-ary time-series embeddings from (1)-(4), we construct a sequence of statistical process summaries that we use to study the constructed text time-series embeddings with regards to long memory and its multifractal extension, stationarity, as well as behaviour at the extremes. We then compare to see if each embedding realised processes are statistically different in some formal manner, by a specialised inference procedure that allows one to test whether two stochastic representations contain common information or not, i.e., whether they were generated from the same underlying process. We then use the linguistic insights we obtain from our analysis to extend Item-Response and log-linear models for contingency tables of word counts into a stochastic formulation that captures the specificities of text we have identified. Specifically, we employ a Multiple-Output Gaussian Process in the intensity of a Poisson regression model with log-linear intensity structure. The structure in the covariance of the Gaussian Process accounts for long memory and the inter-dependence between word selection. The way we formulate our model allows us to measure the contribution of each of the embeddings we have constructed, hence draw conclusions on the representational power each contains.
Original languageEnglish
Publication statusPublished - 11 Feb 2021

Keywords

  • natural language
  • text processing
  • long memory
  • persistence
  • multifractal time-series
  • Brownian bridge
  • Multiple-Output Gaussian Processes
  • item-response models
  • contingency tables

Fingerprint

Dive into the research topics of 'A Statistical Analysis of Text: Embeddings, Properties and Time-Series Modeling'. Together they form a unique fingerprint.

Cite this