Final Report GSOC 2018

Mon 13 August 2018

Report of work done so far

GSOC 2018 Summary

This summer I focused primarily on improvements to the PyMC3 Gaussian process module. These items include support for multiple observations, additional capabilities for larger data sets, a "Latent" Kronecker implementation, and additional supporting documentation and in-depth examples.

Currently, the Latent Kronecker implementation is ready to merge pending a final review. All tests are passing and the example runs smoothly. The pull request is here, with the demo here.

I added an in-depth tutorial on GP modeling by extending model of the textbook example "Mauna Loa" data set to include ice core data. This example describes some of the advanced modeling that can be done with PyMC3 and the GP library, such as custom mean and covariance functions, GPs with uncertain inputs and complex kernel design. I incorporated feedback from a scientist who works in the area, with whom I connected by one of the core PyMC3 developers. Some cosmetic tweaks are required before merge.

The multiple observations functionality proved more challenging. The original design of the GP testing suite compared GP implementations to the base, un-approximated GP implementation in PyMC3. Since all of the tests essentially depend on each other, adjusting the functionality of one implementation causes a large number, sometimes nearly all, of the tests to fail. In order to solve this, I went through several iterations refactoring the GP testing suite until I found an approach such that all tests both demonstrate correctness of the implementation, and are independent. The downside is that the tests run much slower.

Another focus was on GP implementations for larger data sets. Since doing further research, I've somewhat cooled on the idea of inducing-point based variational approximations. The pros of these implementations are that (through minibatching the data) they can scale to data with millions of points and they require no special structure in the space of independent variables. However, they require the likelihood to be input to the GP constructor, which clashes with the flexibility that's fundamental to PyMC3. If one needs to make such a specialized and "GP-only" model, they are likely better off using a GP only library that's devoted to such specializations. Also, inducing-point models perform best on data sets where the underlying GP has longer relative lengthscales. Data where shorter length variation are just as common, and here, inducing point methods often fall short.

I've prototyped a working implementation of a basic Latent inducing point model before GSOC started (Marginal is already finished). This was picked up and submitted as a PR by someone else, but may have been abandoned and is not yet merged. I intend to reach out and possibly complete it if the other person is not interested. After seeing a Stan implementation, I have done some research into Nearest Neighbor Gaussian processes (NNGP), which may be a better fit for PyMC3. It appears very effective for spatial modeling, far more so than inducing point based approximations. Potential problems are that it requires the use of Theano's "scan" function, and adds a very large number of small covariance matrices to the computational graph, which may cause memory issues, negating it's ability to fulfill its purpose of handling large data sets.

I also spent some time exploring PyMC4, which will have as similar a syntax as possible to PyMC3, but use a different backend for autodifferentiation. Mostly for my own learning, and as an exercise in thinking about design, I made a working proof-of-concept using HIPS/autograd (the dynamic graph numpy backend). My goal is for the design to be completely uncoupled from the backend, such that it could be used with either static and dynamic graph backends. In order to test this out, I intend to get this to also work with a static graph.

What remains

What I intent to finish in the next few weeks of unofficial "extra time"

Finishing the test refactor
Support for multiple observations
Prototyping NNGP
Cleaning up the Latent inducing point implementation and getting it merged