# Background¶

## Bayesian Inference¶

In the context of Bayesian inference, we are often interested
in estimating the **posterior** \(P(\boldsymbol{\Theta} | \mathbf{D}, M)\)
of a set of **parameters** \(\boldsymbol{\Theta}\) for a given **model**
\(M\) given some **data** \(\mathbf{D}\). This can be factored into a
form commonly known as **Bayes’ Rule** to give

where

is the **likelihood**,

is the **prior**, and

is the **evidence**, with the integral taken over the entire domain
\(\Omega_{\boldsymbol{\Theta}}\) of \(\boldsymbol{\Theta}\)
(i.e. over all possible \(\boldsymbol{\Theta}\)).

For complicated data and models, the posterior is often intractable and must be estimated using numerical methods (see, e.g., here).

## Nested Sampling¶

### Overview¶

Nested sampling
is a method for estimating the Bayesian evidence \(\mathcal{Z}\) first
proposed and developed by John Skilling. The basic idea
is to approximate the evidence by integrating
the prior in nested “shells” of constant likelihood.
Unlike Markov Chain Monte Carlo (MCMC) methods
which can only generate samples *proportional to* the posterior, Nested
Sampling simultaneously estimates both the evidence and the posterior.
It also has a variety of appealing statistical properties, which include:

well-defined stopping criteria for terminating sampling,

generating a sequence of independent samples,

flexibility to sample from complex, multi-modal distributions,

the ability to derive how statistical and sampling uncertainties impact results

*from a single run*, andbeing trivially parallelizable.

Dynamic Nested Sampling goes even further by allowing samples to be allocated adaptively during the course of a run to better sample areas of parameter space to maximize a chosen objective function. This allows a particular Nested Sampling algorithm to adapt to the shape of the posterior in real time, improving both accuracy and efficiency.

These points will be discussed elsewhere in the documentation when relevant.

### How It Works¶

Nested Sampling attempts to estimate \(\mathcal{Z}\) by treating the
integral of the posterior over all \(\boldsymbol{\Theta}\) as instead an
integral over the **prior volume**

contained within an **iso-likelihood contour** set by
\(\mathcal{L}(\boldsymbol{\Theta}) = \lambda\) via:

assuming \(\mathcal{L}(X(\lambda)) = \lambda\) exists. In other words, if we can evaluate the iso-likelihood contour \(\mathcal{L}_i \equiv \mathcal{L}(X_i)\) associated with a bunch of samples from the prior volume \(1 > X_0 > X_1 > \dots > X_N > 0\), we can compute the evidence using standard numerical integration techniques (e.g., the trapezoid rule). Computing the evidence using these “nested shells” is what gives Nested Sampling its name.

### Basic Algorithm¶

Draw \(K\) **“live” points** (i.e. particles) from the prior
\(\pi(\boldsymbol{\Theta})\). At each iteration \(i\), remove the live
point with the lowest likelihood \(\mathcal{L}_i\) and replace it with a
new live point *sampled from the prior* subject to the constraint
\(\mathcal{L}_{i+1} \geq \mathcal{L}_i\). It can be shown through some neat
statistical arguments (see Nested Sampling Errors) that this sampling
procedure actually allows us to estimate the prior volume of the *previous*
live point (a **“dead” point**) as:

Once some stopping criteria are reached and sampling terminates, the remaining set of live points are distributed uniformly within the final prior volume. These can then be “recycled” and added to the list of samples.

### Evidence Estimation¶

The evidence integral can be numerically approximated using a set of \(N\) dead points via

where \(\hat{w}_i\) is each point’s estimated weight.
For a simple linear integration scheme using rectangles, we can take
\(f(\mathcal{L}_i) = \mathcal{L}_i\) and
\(f(\Delta X_i) = X_{i-1} - X_i\).
For a quadratic integration scheme using trapezoids (as used in `dynesty`

),
we instead can take
\(f(\mathcal{L}_i) = (\mathcal{L}_{i-1} + \mathcal{L}_i) / 2\).

### Posterior Estimation¶

We can subsequently estimate posteriors “for free” from the same \(N\)
dead points by assigning each sample its associated **importance weight**

### Stopping Criteria¶

The remaining evidence \(\Delta \hat{\mathcal{Z}}_i\) at iteration \(i\) can roughly be bounded by

where \(\mathcal{L}_{\max}\) is the the maximum likelihood point contained within the remaining set of \(K\) live points. This essentially assumes that the remaining prior volume interior to the last dead point is a uniform slab with likelihood \(\mathcal{L}_{\max}\).

This can be turned into a relative stopping criterion by using the (log-)ratio between the current estimated evidence \(\hat{\mathcal{Z}}_i\) and the remaining evidence \(\Delta \hat{\mathcal{Z}}_i\):

Stopping at a given \(\Delta \ln \hat{\mathcal{Z}}_i\) value (`dlogz`

)
then means sampling until only a *fraction* of the evidence remains unaccounted
for.

In general, this error estimate serves as a (rough) upper bound (since \(X_i\) is not exactly known) that can be used for deciding when to stop sampling from an arbitrary distribution while estimating the evidence. Other stopping criteria are discussed in Dynamic Nested Sampling.

### Challenges¶

Nested Sampling has two main main theoretical requirements:

Samples must be evaluated

*sequentially*subject to the likelihood constraint \(\mathcal{L}_{i+1} \geq \mathcal{L}_{i}\), andAll samples used to compute/replace live points must be

**independent and identically distributed (i.i.d.)**random variables*drawn from the prior*.

The first requirement is entirely algorithmic and straightforward to satisfy
(even when sampling in parallel). The second requirement, however, is much more
challenging if we hope to sample efficiently: while it is straightforward to
generate samples from the prior, by design Nested Sampling makes this simple
scheme increasingly more inefficient since the remaining prior volume
shrinks *exponentially* over time.

Solutions to this problem often involve some combination of:

Proposing new live points by “evolving” a copy of one (or more) current live points to new (independent) positions subject to the likelihood constraint, and

Bounding the iso-likelihood contours using simple but flexible functions in order to exclude regions with lower likelihoods.

In both cases, it is much easier to deal with uniform (rather than arbitrary)
priors. As a result, most nested sampling algorithms/packages (including
`dynesty`

) are designed to sample within the \(D\)-dimensional unit cube.
Samples are transformed samples back to the original parameter space
“on the fly” only when needed to evaluate the likelihood.
Accomplishing this requires an appropriate **prior transform**, described
in more detail under Prior Transforms.

## Typical Sets¶

One of the elegant features of Nested Sampling is it directly incorporates the
ideas behind a typical set
into the estimation. Since this concept is **crucial** in most
Bayesian inference applications but rarely discussed explicitly in
applied methods such as MCMC, it is important to take some time to
discuss it in more detail.

### Quick Overview¶

In general, the contribution to the posterior at a given value (position)
\(\boldsymbol{\Theta}\) has two components. The first arises from the
particular *value* of the posterior itself, \(P(\boldsymbol{\Theta})\).
The second arises from the total (differential) *volume*
\(dV(\boldsymbol{\Theta})\) encompassed by all
\(\boldsymbol{\Theta}\)’s with the particular
\(P(\boldsymbol{\Theta})\). We can understand this intuitively: the
contributions from a small region with large posterior values can be
overwhelmed by contributions from much larger regions with small posterior
values.

This “tug of war” between the two elements means that the regions which contribute the most to the overall posterior are those that maximize the joint quantity

This region typically forms a “shell” surrounding the mode (i.e. the
maximum a posteriori (MAP) value)
and is what is usually referred to as the **typical set**. This behavior
becomes more accentuated as the dimensionality increases: since volume scales
as \(r^D\), increasing the dimensionality of the problem creates
exponentially more volume further away from the posterior mode.

### Typical Sets in Nested Sampling¶

Under the framework of Nested Sampling, this concept naturally emerges from the concept of integrating the evidence in shells of “prior volume”:

We can see directly that the contribution of a particular iso-likelihood
contour \(\mathcal{L}(X)\) to the integral depends both on its
“amplitude” \(\mathcal{L}(X)\) along with the (differential) prior volume
\(dX\) it occupies. This is maximized when both these quantities are
jointly maximized, which occurs over points that represent the typical set.
Because of the contribution from the “density” and “volume” terms
are clearly seen here, this is sometimes also referred to as
the **posterior mass**.
Since the posterior importance weights

are also directly proportional to these quantities, Nested Sampling also naturally weights samples by their contribution to the typical set.

## Priors in Nested Sampling¶

Unlike MCMC or similar methods, Nested Sampling starts by randomly sampling from the entire parameter space specified by the prior. This is not possible unless the priors are “proper” (i.e. that they integrate to 1). So while Normal priors spanning (\(-\infty\), \(+\infty\)) are fine, Uniform priors spanning the same range are not and must be bounded.

**It cannot be
stressed enough that the evidence is entirely dependent on the “size” of the
prior.** For instance, a wider Uniform prior will decrease the contribution of
high-likelihood regions to the evidence estimate, leading to a lower overall
value. Priors should thus be carefully chosen to ensure models can be properly
compared using the evidences computed from Nested Sampling.

In addition to affecting the evidence estimate, the prior also directly affects
the overall expected runtime. Since, in general, the posterior
\(P(\boldsymbol{\Theta})\) is (much) more localized that the prior
\(\pi(\boldsymbol{\Theta})\), the “information” we gain from updating
from the prior to the posterior can be characterized by the
**Kullback-Leibler (KL) divergence** (see
here
for more information):

It can be shown/argued that the total number of steps \(N\) needed to integrate over the posterior roughly scales as:

In other words, increasing the size of the prior *directly* impacts the amount
of time needed to integrate over the posterior. This highlights one of the
main drawbacks of nested sampling: **using less “informative” priors will
increase the expected number of nested sampling iterations**.