Topic Modeling (TM) is a rapidly-growing area at the interfaces of text mining, artificial intelligence and statistical modeling, that is being increasingly deployed to address the ‘information overload’ associated with extensive text repositories. The goal in TM is typically to infer a rich yet intuitive summary model of a large document collection, indicating a specific collection of topics that characterizes the collection – each topic being a probability distribution over words – along with the degrees to which each individual document is concerned with each topic. The model then supports segmentation, clustering, profiling, browsing, and many other tasks. Current approaches to TM, dominated by Latent Dirichlet Allocation (LDA), assume a topic-driven document generation process and find a model that maximizes the likelihood of the data with respect to this process. This is clearly sensitive to any mismatch between the ‘true’ generating process and statistical model, while it is also clear that the quality of a topic model is multi-faceted and complex. Individual topics should be intuitively meaningful, sensibly distinct, and free of noise. Here we investigate multi-objective approaches to TM, which attempt to infer coherent topic models by navigating the trade-offs between objectives that are oriented towards coherence as well as coverage of the corpus at hand. Comparisons with LDA show that adoption of MOEA approaches enables significantly more coherent topics than LDA, consequently enhancing the use and interpretability of these models in a range of applications, without significant degradation in generalization ability.
|Number of pages
|Published - 19 Mar 2013
|7th International Conference on Evolutionary Multicriterion Optimization - Sheffield, United Kingdom
Duration: 19 Mar 2013 → 23 Mar 2013
|7th International Conference on Evolutionary Multicriterion Optimization
|19/03/13 → 23/03/13