Departmental Colloquia: Lauren Hannah

LAUREN HANNAH hannah

Department of Statistics
Columbia University

 

 “Summarizing Topics: From Word Lists to Descriptive Phrases”

 

ABSTRACT

We introduce a statistically principled, computationally efficient, two-stage method for generating phrase-based topic summaries from the inferred parameters of any statistical topic model based on latent Dirichlet allocation. This method involves 1) identifying n-gram phrases and 2) selecting descriptive words and phrases for each topic using a novel metric, KALE, that balances distinctiveness  and recognizability. We describe three different phrase-finding algorithms, including a new Bayesian algorithm, which does not rely on topic model parameters and therefore constitutes a general-purpose, stand-alone contribution to the phrase-finding literature. We provide a human-subjects evaluation of our two-stage topic summarization method, comparing summaries produced using each phrase-finding algorithm to summaries consisting of the most probable words for each topic. We also compare summaries produced by the best performing variant of our method to those produced by other commonly used topic summarization methods.