TY - UNPB
T1 - A Statistical Analysis of Text: Embeddings, Properties and Time-Series Modeling
AU - Chalkiadakis, Ioannis
AU - Peters, Gareth
AU - Chantler, Michael John
AU - Konstas, Ioannis
PY - 2021/2/11
Y1 - 2021/2/11
N2 - In this research we study the fundamental structure of written natural language. We begin by constructing text stochastic embeddings such that we preserve the key features of natural language: the time-dependence, semantics, and the laws of grammar and syntax. We present in detail how to clean and prepare the raw text for processing, and then use a common infrastructure of $N$-ary relations to construct all our target embeddings: (1) we construct a time-series representation of the well-known Bag-of-Words (BoW) model to capture the frequency characteristics of text processing units (e.g. words), (2) we utilise the context-free grammar formalism that produces a tree valued time-series indexed by the $n$-gram index, (3) we employ the dependency grammar formalism that produces a graph valued time-series indexed by the $n$-gram index, and (4) we combine BoW and syntax to construct an embedding of word co-occurrence frequencies under restrictions to particular syntactic structures. In considering the observed $N$-ary time-series embeddings from (1)-(4), we construct a sequence of statistical process summaries that we use to study the constructed text time-series embeddings with regards to long memory and its multifractal extension, stationarity, as well as behaviour at the extremes. We then compare to see if each embedding realised processes are statistically different in some formal manner, by a specialised inference procedure that allows one to test whether two stochastic representations contain common information or not, i.e., whether they were generated from the same underlying process. We then use the linguistic insights we obtain from our analysis to extend Item-Response and log-linear models for contingency tables of word counts into a stochastic formulation that captures the specificities of text we have identified. Specifically, we employ a Multiple-Output Gaussian Process in the intensity of a Poisson regression model with log-linear intensity structure. The structure in the covariance of the Gaussian Process accounts for long memory and the inter-dependence between word selection. The way we formulate our model allows us to measure the contribution of each of the embeddings we have constructed, hence draw conclusions on the representational power each contains.
AB - In this research we study the fundamental structure of written natural language. We begin by constructing text stochastic embeddings such that we preserve the key features of natural language: the time-dependence, semantics, and the laws of grammar and syntax. We present in detail how to clean and prepare the raw text for processing, and then use a common infrastructure of $N$-ary relations to construct all our target embeddings: (1) we construct a time-series representation of the well-known Bag-of-Words (BoW) model to capture the frequency characteristics of text processing units (e.g. words), (2) we utilise the context-free grammar formalism that produces a tree valued time-series indexed by the $n$-gram index, (3) we employ the dependency grammar formalism that produces a graph valued time-series indexed by the $n$-gram index, and (4) we combine BoW and syntax to construct an embedding of word co-occurrence frequencies under restrictions to particular syntactic structures. In considering the observed $N$-ary time-series embeddings from (1)-(4), we construct a sequence of statistical process summaries that we use to study the constructed text time-series embeddings with regards to long memory and its multifractal extension, stationarity, as well as behaviour at the extremes. We then compare to see if each embedding realised processes are statistically different in some formal manner, by a specialised inference procedure that allows one to test whether two stochastic representations contain common information or not, i.e., whether they were generated from the same underlying process. We then use the linguistic insights we obtain from our analysis to extend Item-Response and log-linear models for contingency tables of word counts into a stochastic formulation that captures the specificities of text we have identified. Specifically, we employ a Multiple-Output Gaussian Process in the intensity of a Poisson regression model with log-linear intensity structure. The structure in the covariance of the Gaussian Process accounts for long memory and the inter-dependence between word selection. The way we formulate our model allows us to measure the contribution of each of the embeddings we have constructed, hence draw conclusions on the representational power each contains.
KW - natural language
KW - text processing
KW - long memory
KW - persistence
KW - multifractal time-series
KW - Brownian bridge
KW - Multiple-Output Gaussian Processes
KW - item-response models
KW - contingency tables
M3 - Preprint
BT - A Statistical Analysis of Text: Embeddings, Properties and Time-Series Modeling
ER -