{
"cells": [
{
"cell_type": "markdown",
"id": "ac42b770-d804-4aec-be8f-f01e4d64f47f",
"metadata": {},
"source": [
"# `hetGPy` Intro: Background and Basic Usage\n",
"\n",
"* This notebook gives a high-level overview of the basics of `hetGPy`\n",
"* We discuss the core model objects, regression, and prediction\n",
"\n",
"This document assumes some prior knowledge of Gaussian Process modeling. For those looking for a refresher, some suggested sources are [Ch.2 of Rassmussen and Williams (2006)](https://gaussianprocess.org/gpml/), [Ch. 5 of Gramacy (2020)](https://bookdown.org/rbg/surrogates/#front-cover), and [Ch. 2-3 of Garnett (2023)](https://bayesoptbook.com/).\n",
"\n",
"\n",
"### What is `hetGPy`?\n",
"\n",
"`hetGPy` is a Python library for homoskedastic and heteroskedastic Gaussian Process modeling. It is a Python port of the `hetGP` R package from [Binois and Gramacy (2018)](https://cran.r-project.org/web/packages/hetGP/index.html). The package also includes functions for sequential design and Bayesian Optimization.\n",
"\n",
"\n",
"### Why is a Python port needed when we have `rpy2` to call R code from Python?\n",
"\n",
"For the basic use cases of Gaussian Process modeling, most of the time you can wrap the R code from `hetGP` via the [`rpy2`](https://rpy2.github.io/) Python library. However, there are three principle motivations for having a Python port:\n",
"\n",
"1. **Environment restrictions and sequential design:**\n",
" \n",
" Gaussian Process modeling is especially useful for computer experiments and Bayesian Optimization. In both domains, it is common to work with expensive simulation functions which are run on high-performance computing clusters, and many of these simulators are written in Python. Thus, when working on computing clusters, it is advantageous to use one type of software rather than trying to build multiple computing environments. Additionally, in `rpy2` the R objects live in R's memory space, and their size is [unknown to Python](https://rpy2.github.io/doc/latest/html/performances.html). For large-scale analyses, this could prove problematic.\n",
"\n",
" In many cases, computer experiments and Bayesian Optimization rely on sequential design: initialzing some set of inputs, acquiring (usually expensive) data, and then choosing the next set of experiments via some criteria or acquisition function. Thus, keeping the entire analysis in one language (in this case, Python) is desirable for human-out-of-the-loop workflows.\n",
"\n",
"\n",
"2. **Opportunity for extensions:** \n",
" \n",
" The core maximum likelihood routines in `hetGP` and `hetGPy` rely on hand-coded likelihoods and gradients, and are thus restricted to a subset of closed-form covariance kernels (in this case, Gaussian and Matern kernels ($\\nu = 3/2, 5/2$)). State of the art Python libraries, such as [`pytorch`](https://pytorch.org/) and [`jax`](https://jax.readthedocs.io/en/latest/index.html) utilize autodifferentiation methods, which would allow for a more flexible use of arbitrary covariance kernels in `hetGPy`.\n",
"\n",
"3. **Availability of high-performance optimization functions:**\n",
" \n",
" Most of the computation in `hetGPy` is the optimization of the maximum likelihood which relies on the [`L-BFGS-B`](https://en.wikipedia.org/wiki/Limited-memory_BFGS#Variants#L-BFBS-B) algorithm, a standard constrained optimization routine. The definitive [`L-BFGS-B`](https://en.wikipedia.org/wiki/Limited-memory_BFGS#Variants#L-BFBS-B) implementation is by [Nocedal and Morales (2011)](https://dl.acm.org/doi/10.1145/2049662.2049669) in Fortran, which is callable from [`scipy`](https://docs.scipy.org/doc/scipy/reference/optimize.minimize-lbfgsb.html) but [not from base R](https://nlmixrdevelopment.github.io/lbfgsb3c/articles/lbfgsb3c.html). \n",
"\n",
"\n",
"### Basic Setup\n",
"\n",
"* The central objects that performs Gaussian process regeression (GPR) in `hetGPy` are the `homGP` and `hetGP` classes for homoskedastic and heteroskedasic regression.\n",
"\n",
"First, we show `homGP` modeling on noiseless data, noisy data, and finally on heteroskedastic noise.\n",
"\n",
"### Noiseless case"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "dc4b27c5-9c78-42b7-b152-5ca232262e41",
"metadata": {},
"outputs": [],
"source": [
"from hetgpy import homGP\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"xgrid = np.linspace(0,2*np.pi, 100).reshape(-1,1)\n",
"xs = xgrid.squeeze() # for plotting\n",
"\n",
"X = np.linspace(0,2*np.pi, 10).reshape(-1,1)\n",
"Z = np.sin(X).squeeze()"
]
},
{
"cell_type": "markdown",
"id": "f05c20a6-ad0b-42b4-b340-b979ebfee29e",
"metadata": {},
"source": [
"* To train a GPR, instantiate the model object (`homGP` or `hetGP`) and call the maximum likelihood estimation function (`mle` which aliases the functions `mleHomGP` and `mleHetGP` which are the equivalent function names in the `hetGP` R package).\n",
"* After training, call `predict` on the model object to return the mean and variance predictions. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6e021f2d-737f-40a7-ad34-29aca7312d0a",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# now train a homoskedastic Gaussian Process Regression\n",
"%config InlineBackend.figure_formats = ['svg']\n",
"fig, ax = plt.subplots()\n",
"ax.scatter(X, Z, color = 'k', label = 'Data')\n",
"ax.legend(edgecolor='black')\n",
"ax.set_xlabel('X'); ax.set_ylabel('sin(X)')\n",
"\n",
"GP = homGP()\n",
"GP.mle(\n",
" X = X,\n",
" Z = Z,\n",
" covtype = \"Gaussian\",\n",
" lower = np.array([0.1]),\n",
" upper = np.array([2.0])\n",
")\n",
"preds = GP.predict(x = xgrid)\n",
"ax.plot(xs, preds['mean'], color = 'blue', label = 'Mean')\n",
"ax.legend();"
]
},
{
"cell_type": "markdown",
"id": "32b9312a-e4d6-42e9-89df-af1e339c6f5a",
"metadata": {},
"source": [
"### Fast Estimation under Replication\n",
"\n",
"* `hetGPy` is designed to handle replication\n",
"* Consider the same example, except this time with some replication (5 each) and noise\n",
"* Like its R counterpart, `hetGPy` does training on the order of the unique input locations $n$ rather than the full dataset $N$. When replication is high, as is common in stochastic computer experiments, this can yield much shorter training times."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9c0b58f8-be2c-49be-abb3-bff94246c5f5",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"rand = np.random.default_rng(seed = 42)\n",
"\n",
"X = np.linspace(0,2*np.pi, 10)\n",
"X = np.repeat(X,repeats = 5).reshape(-1,1)\n",
"Z = np.sin(X).squeeze() \n",
"Z += 0.2 * rand.normal(loc = 0,size = len(Z))\n",
"\n",
"fig, ax = plt.subplots()\n",
"\n",
"ax.plot(xs, np.sin(xs), label = 'True', color = 'red', alpha = 0.4)\n",
"ax.scatter(X, Z, color = 'k', label = 'Data')\n",
"ax.legend(edgecolor='black')\n",
"ax.set_xlabel('X'); ax.set_ylabel('sin(X)');\n",
"\n",
"\n",
"GP = homGP()\n",
"GP.mle(\n",
" X = X,\n",
" Z = Z,\n",
" covtype = \"Gaussian\",\n",
" lower = np.array([0.1]),\n",
" upper = np.array([2.0])\n",
")\n",
"preds = GP.predict(x = xgrid, interval = 'predictive', interval_lower = 0.05, interval_upper = 0.95)\n",
"interval = preds['predictive_interval']\n",
"\n",
"ax.plot(xs, preds['mean'], color = 'blue', label = 'Mean (homGP)')\n",
"ax.plot(xs, interval['lower'], color = 'blue', label = '90% Predictive Interval (homGP)', linestyle = 'dashed')\n",
"ax.plot(xs, interval['upper'], color = 'blue',linestyle = 'dashed')\n",
"ax.legend(edgecolor='black');"
]
},
{
"cell_type": "markdown",
"id": "78e4535d",
"metadata": {},
"source": [
"### Heteroskedastic Gaussian Process Regression"
]
},
{
"cell_type": "markdown",
"id": "77c5a23e",
"metadata": {},
"source": [
"We can also perform heteroskedastic GPR. Consider the [`mcycle`](https://rdrr.io/cran/MASS/man/mcycle.html) motorocycle simulation data from R (available in `hetGPy`) which is known to have heteroskedasticity.\n",
"\n",
"The major advantage of a heteroskedastic approach is the improved uncertainty quantification, in that we estimate a wider predictive interval in the higher noise regions, and a narrow one in lower noise regions, especially compared to a homoskedastic model."
]
},
{
"cell_type": "markdown",
"id": "5b89f1a8",
"metadata": {},
"source": [
"We model the noise as a series of latent variables at each unique input via a second (homoskedastic) GP. The model for the noise process is from [Binois and Gramacy (2018)](https://www.tandfonline.com/doi/full/10.1080/10618600.2018.1458625) and is also given in [Gramacy (2020), Ch. 10](https://bookdown.org/rbg/surrogates/chap10.html#chap10varp) (equation 10.7):\n",
"\n",
"\\begin{align*}\n",
"\\Lambda_n = C_{(\\delta)} K_{(\\delta)}^{-1} \\Delta_n, \\text{ where } K_{(\\delta)} = (C_{(\\delta)} + g_{(\\delta)} A_n^{-1}).\n",
"\\end{align*}\n",
"\n",
"Where $\\Lambda_n$ is the prediction of the noise, $C_{(\\delta)}$ is the covariance matrix, $\\Delta_n$ is the estimated variance at each input, and $A_n^{-1}$ is the inverse of the number of replicates for each input location (stored in a diagonal matrix)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "887bca58-9c88-4f86-9b5b-e3937e7eda4a",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from hetgpy import hetGP\n",
"from hetgpy.example_data import mcycle\n",
"\n",
"m = mcycle()\n",
"X = m['times']\n",
"Z = m['accel']\n",
"\n",
"xgrid = np.linspace(0,60,301).reshape(-1,1) # predictive grid\n",
"xs = xgrid.squeeze()\n",
"\n",
"hom, het = homGP(), hetGP()\n",
"\n",
"# homGP\n",
"hom.mle(\n",
" X = X,\n",
" Z = Z,\n",
" covtype = \"Matern5_2\",\n",
" lower = np.array([0.1]),\n",
" upper = np.array([50.0])\n",
")\n",
"preds_hom = hom.predict(x = xgrid, interval = 'predictive', interval_lower = 0.05, interval_upper = 0.95)\n",
"interval_hom = preds_hom['predictive_interval']\n",
"\n",
"# hetGP\n",
"het.mle(\n",
" X = X,\n",
" Z = Z,\n",
" covtype = \"Matern5_2\",\n",
" lower = np.array([0.1]),\n",
" upper = np.array([50.0])\n",
")\n",
"preds_het = het.predict(x = xgrid, interval = 'predictive', interval_lower = 0.05, interval_upper = 0.95)\n",
"interval_het = preds_het['predictive_interval']\n",
"\n",
"\n",
"fig, ax = plt.subplots(nrows=2,figsize=(9,6),sharey=True)\n",
"\n",
"ax[0].scatter(X.squeeze(),Z,label='Data',color='black')\n",
"ax[1].scatter(X.squeeze(),Z,color='black')\n",
"\n",
"ax[0].plot(xs, preds_hom['mean'], color = 'blue', label = 'homGP')\n",
"ax[0].plot(xs, interval_hom['lower'], color = 'blue', linestyle = 'dashed')\n",
"ax[0].plot(xs, interval_hom['upper'], color = 'blue',linestyle = 'dashed')\n",
"\n",
"ax[1].plot(xs, preds_het['mean'], color = 'red', label = 'hetGP')\n",
"ax[1].plot(xs, interval_het['lower'], color = 'red', linestyle = 'dashed')\n",
"ax[1].plot(xs, interval_het['upper'], color = 'red',linestyle = 'dashed')\n",
"\n",
"fig.legend(edgecolor='black',loc='upper center',ncol=3,bbox_to_anchor=(0.5,1.05))\n",
"[ax[i].set_xlabel('times (milliseconds)') for i in range(2)];\n",
"[ax[i].set_ylabel('accel (g)') for i in range(2)];\n",
"fig.tight_layout()"
]
},
{
"cell_type": "markdown",
"id": "7d2bee36",
"metadata": {},
"source": [
"### Specifying initializations for the hyperparamters\n",
"* Specify starting values for the hyperparameters via the `init = {param:value}` argument\n",
"\n",
"For example, to initialize the lengthscale $\\theta$ at $\\theta_0=10$:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "786ede78",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Lengthscale: [9.95168406]\n",
"Log likelihood: -499.2919739824025\n"
]
}
],
"source": [
"GP = hetGP()\n",
"GP.mle(\n",
" X = X,\n",
" Z = Z,\n",
" lower = np.array([1.0]),\n",
" upper = np.array([10]),\n",
" init = {'theta': np.array([10.0])}\n",
")\n",
"print(f\"Lengthscale: {GP.theta}\")\n",
"print(f\"Log likelihood: {GP.ll}\")"
]
},
{
"cell_type": "markdown",
"id": "c901d20c",
"metadata": {},
"source": [
"### Specifying known values of hyperparameters\n",
"* You can also override the hyperparameter learning via the `known` argument\n",
"\n",
"For example, to fix the lengthscale of the mean process at $\\theta=3$ (and note the lower fit based on log-likelood):"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "5ef6cf26",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Lengthscale: [3.]\n",
"Log likelihood: -513.7520035600228\n"
]
}
],
"source": [
"GP = hetGP()\n",
"GP.mle(\n",
" X = X,\n",
" Z = Z,\n",
" lower = np.array([1.0]),\n",
" upper = np.array([10]),\n",
" known = {'theta': np.array([3.0])}\n",
")\n",
"print(f\"Lengthscale: {GP.theta}\")\n",
"print(f\"Log likelihood: {GP.ll}\")\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}