Comparison with george

One of the tinygp design decisions was to provide a high-level API similar to the one provided by the george GP library. This was partly because I (as the lead developer of george) wanted to ease users’ transitions away from george to something more modern (like tinygp). I also quite like the george API and don’t think that there exist other similar tools. The defining feature is tinygp does not include built-in implementations of inference algorithms. Instead, it provides an expressive model-building interface that makes it easy to experiement with different kernels while still integrating with your favorite inference engine.

In this document, we compare the interfaces and computational performance of george and tinygp for constructing kernel models and evaluating the GP marginalized likelihood. Since tinygp supports GPU-acceleration, we have executed this notebook on a machine with a GPU with the following specs:

!nvidia-smi
Fri Jul 23 16:50:33 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1A:00.0 Off |                    0 |
| N/A   32C    P0    67W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

By default, the CPU versions of both george and tinygp will also use parellelized linear algebra libraries to take advantage of multiple CPU threads, however to make the benchmarks more replicable, we’ll disable this parallelization for the remainder of this notebook:

import os

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["XLA_FLAGS"] = (
    os.environ.get("XLA_FLAGS", "")
    + " --xla_cpu_multi_thread_eigen=false intra_op_parallelism_threads=1"
)

Then we generate some simulated data and define functions for computing the GP log likelihood using george and tinygp (with separate CPU and GPU version). As mentioned above, the syntax of these functions is quite similar, but there are a few differences. Most notably, the units of the “metric” or “length scale” parameter in the kernel is different (length-squared in george and not squared in tinygp). Also, the gp.compute method no longer exists in tinygp since this would be less compatible with jax’s preference for pure functional programming.

try:
    import tinygp
except ImportError:
    !pip install -q tinygp

try:
    import george
except ImportError:
    !pip install -q george
from functools import partial

import numpy as np
import matplotlib.pyplot as plt

import jax
from jax import config
import jax.numpy as jnp

import george
import tinygp

config.update("jax_enable_x64", True)

sigma = 1.5
rho = 2.5
jitter = 0.1

random = np.random.default_rng(49382)
x = np.sort(random.uniform(0, 10, 10_000))
y = np.sin(x) + jitter * random.normal(0, 1, len(x))


def george_loglike(x, y, **kwargs):
    kernel = sigma ** 2 * george.kernels.Matern32Kernel(rho ** 2)
    gp = george.GP(kernel, **kwargs)
    gp.compute(x, jitter)
    return gp.log_likelihood(y)


def tinygp_loglike(x, y):
    kernel = sigma ** 2 * tinygp.kernels.Matern32(rho)
    gp = tinygp.GaussianProcess(kernel, x, diag=jitter ** 2)
    return gp.condition(y)


hodlr_loglike = partial(
    george_loglike, solver=george.solvers.HODLRSolver, tol=0.5
)
tinygp_loglike_cpu = jax.jit(tinygp_loglike, backend="cpu")
tinygp_loglike_gpu = jax.jit(tinygp_loglike, backend="gpu")

Now we benchmark the computational cost of computing the log likelihood using each of these methods:

ns = [10, 100, 1_000, 10_000]
george_time = []
hodlr_time = []
cpu_time = []
gpu_time = []
for n in ns:
    print(f"N = {n}:")

    # Run the JAX versions once each to make sure they get JIT'd
    args = x[:n], y[:n]
    tinygp_loglike_cpu(*args).block_until_ready()
    tinygp_loglike_gpu(*args).block_until_ready()
    
    results = %timeit -o george_loglike(*args)
    george_time.append(results.average)

    results = %timeit -o hodlr_loglike(*args)
    hodlr_time.append(results.average)
    
    results = %timeit -o tinygp_loglike_cpu(*args).block_until_ready()
    cpu_time.append(results.average)
    
    results = %timeit -o tinygp_loglike_gpu(*args).block_until_ready()
    gpu_time.append(results.average)
N = 10:
385 µs ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
283 µs ± 488 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
14.2 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
113 µs ± 4.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
N = 100:
891 µs ± 425 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
629 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
398 µs ± 428 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
191 µs ± 672 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N = 1000:
31.3 ms ± 49.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
4.95 ms ± 3.73 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
17.5 ms ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.54 ms ± 544 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N = 10000:
8.89 s ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
70 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
7.2 s ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
84.1 ms ± 33.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In the plot of this benchmark, you’ll notice several features:

  1. For very small datasets, the tinygp CPU implementation is significantly faster than any of the other implementations. This is because jax.jit removes a lot of the Python overhead that is encountered when chaining numpy functions.

  2. For medium to large datasets, tinygp is generally faster than george, with the GPU version seeing a significant advantage.

  3. The CPU implementations approach the expected asymptotic complexity of \(\mathcal{O}(N^3)\) only for the largest values of \(N\). This is probably caused by memory allocation overhead or other operations with better scaling than the Cholesky factorization.

  4. The approximate “HODLR” solver from george outperforms the GPU-enabled tinygp exact solver, but only for very large datasets, and it’s important to note that the HODLR method does not scale well to larger input dimensions. Any existing or future approximate solvers like this that are implemented in jax could be easily used in conjunction with tinygp, but such things have not yet been implemented.

plt.loglog(ns, george_time, "o-", label="george (basic)")
plt.loglog(ns, hodlr_time, "o-", label="george (HODLR)")
plt.loglog(ns, cpu_time, "o-", label="tinygp (CPU)")
plt.loglog(ns, gpu_time, "o-", label="tinygp (GPU)")
ylim = plt.ylim()
plt.loglog(
    ns,
    0.5 * np.array(ns) ** 3 / ns[-1] ** 3 * cpu_time[-1],
    ":k",
    label="O($N^3$)",
)
plt.ylim(ylim)
plt.legend()
plt.xlabel("number of data points")
plt.ylabel("runtime [s]");
../_images/george_10_0.png