CMT Group Seminar | February 11, 10:00
Large scale probabilistic modeling and machine learning
Modern data analysis requires computations on massive data. For example, consider the problem of automatically classifying articles in a huge digital archive containing millions of entries, or recommending items to millions of users based on their purchase history. Bayesian probabilistic modeling allows us to make assumptions about hidden structure in the data that is not directly observable. In Bayesian inference, we fit a probability distribution that reveals this structure. Bayesian inference has been inspired by theoretical physics over many years, a prominent example being Markov Chain Monte Carlo algorithms. In a more recent approach called variational inference, Bayesian inference is mapped to an optimization problem. Here, we fit a parametrized 'mean field' distribution by optimizing over variational parameters in a way to maximize the statistical evidence of the data. This method scales up to massive data sets when using stochastic optimization, termed stochastic variational inference (SVI). SVI uses easy-to-compute noisy gradients by subsampling from the large underlying data set. It suffers from two major problems: noisy gradients and non-convexity. We introduce a scheme that reduces the noise by averaging over parts of the past gradients. This introduces a bias, and we discuss the tradeoff between variance and bias. To enable convergence to better local optima, we introduce deterministic annealing for SVI. We introduce a temperature parameter that deterministically deforms the objective, and then reduce this parameter over the course of the optimization. We test both methods on Latent Dirichlet Allocation, a topic model, applied to three large text corpora.
Columbia University
Seminar room 0.01, ETP
Contact: Achim Rosch