FAQ
This page contains a collection of frequently asked questions
from dynesty
users, along with some answers that hopefully are helpful to
you. If you don’t see your particular issue addressed here, feel free to
open an issue.
For citation information, see the :ref:`Citations` section on the homepage.
Sampling Questions
What sampling method should I be using?
This is always problem-dependent, but some general advice is
to select one based on the dimensionality of your problem. In low dimensions,
uniform sampling is often quite efficient since the bounding distributions
can often encompass the majority of the prior volume. In moderate dimensions,
random walks often can serve as an effective way to propose new points
without relying on the exact shape/size of the bounds being correct. In
higher dimensions, we generally need non-rejection methods such as slice
sampling to generate samples efficiently since the prior volume is so large.
Using gradients can also help generate efficient proposals in this regime.
dynesty
uses these rules-of-thumb by default to choose a sampling option
with 'auto'
.
Sampling seems to freeze around an efficiency of 10%. Is this a bug?
This isn’t a bug, but probably just a consequence of the first bounding update.
By default, dynesty
waits to actually start sampling using the proposed
sampling/bounding methods passed to the sampler until a set of conditions
specified in first_update
are satisfied. This lets the live points somewhat
move away from the edges of the prior and begin to adapt to the shape of the
target distribution, which helps to avoid problems such as the bounds
“shredding” the live points into lots of tiny islands. The basic heuristic
used is to wait until uniform proposals from the prior hit a cumulative
efficiency of 10%, but that threshold can be adjusted using the
first_update
argument.
Is there an easy way to add more samples to an existing set of results?
Yes! There are actually a bunch of ways to do this.
If you have the static NestedSampler
run, the best strategy is to just
start a new independent run and then combine the old and new runs together
into a single (improved) run using the merge_runs()
function.
If you’re using the DynamicNestedSampler
, you can add as many batches as you want by running add_batch
. The add_batch have a auto
mode, or a manual
mode where you can add new batch corresponding to a certain likelihood
range (i.e. corresponding to where your posterior is concentrated for example).
Alternatively, if you are primarily interested in the posterior, you can use larger values of n_effective parameter of run_nested
as that will ensure your posterior is less noisy.
Finally, merge_runs()
also works with results generated
from Dynamic Nested Sampling, so it is just as easy to set off a new run and
combine it with your original result.
I have a very multi-modal posterior, or dynesty cannot find the mode of my posterior. What should I do
If you have a large number of posterior modes, or a very narrow posterior mode occupying a tiny volume of your posterior, it may be difficult to find it.
Typically using more live-points will help in those cases, as you have higher chance of discovering the your mode with one of the live-points.
Also if you have a really large number of modes, you can use add_batch
functionality with logl_bounds
of (-inf,inf), to basically do repeated sampling of the posterior. In this case you have a good chance of discovering all the modes in your data.
There are inf values in my lower/upper log-likelihood bounds! Should I be concerned?
In most cases no. As mentioned in Running Internally, these values
are just the lower and upper limits of the log-likelihood used to limit
your sampling. If you’re sampling starting from the prior,
you’re starting out from a likelihood of 0 and therefore a
log-likelihood of -inf
. If you haven’t specified a particular logl_max
to terminate sampling, the default value is set to be +inf
so it will
never prematurely terminate sampling. These values can change during
Dynamic Nested Sampling, at which point they serve as the endpoints between
which a new batch of live points is allocated.
In rare cases, errors in these bounds can be signs of Bad Things that may
have happened while sampling. This is often the case if the
log-likelihood values being sampled (and displayed) are also
are nonsensical (e.g., involve nan
or inf
values, etc.).
In that case, it is often useful to terminate the run early
and examine the set of samples to see if there are any possible issues.
Sometimes while sampling my estimated evidence errors become undefined! Should I be concerned?
Most often this is not a cause for concern. As mentioned in
Approximate Evidence Errors, dynesty
uses an approximate method to
estimate evidence errors in real time based on the KL divergence
(“information gain”) and the current number of live points.
Sometimes this approximation can lead to
improper results (i.e. negative variances), which can often occur
early in the run when there is a lot of uncertainty in the prior volume.
While this often “corrects” itself later in the run,
sometimes the effect can persist. Regardless of
whether the approximation converges, however, errors can still be computed
using the functions described in Nested Sampling Errors as normal.
I am currently working on developing a more robust approximation that
avoids some of these issues.
In rare cases, issues with the evidence error approximation can be a sign
that something has gone Terribly Wrong during the sampling phase. This
is often the case if the log-likelihood values being output also
are nonsensical (e.g., involve nan
or inf
values).
In that case, it is often useful to terminate the run early
and examine the set of samples to see if there are any possible issues.
When adding batches of live points sometimes the log-likelihoods being displayed don’t monotonically increase as I expect. What’s going on?
When points are added in each batch, they are allocated randomly between the lower and upper log-likelihood bounds (since they are being sampled randomly). These values are the ones being output to the terminal. Once all the points have been allocated, then nested sampling can begin by replacing each of the lowest log-likelihood values with a better one.
Sampling is taking much longer than I’d like. What should I do?!
Unfortunately, there’s no catch-all solution to this. The most important
first step is to make sure you’re examining real-time outputs using the
print_progress=True
option (enabled by default) if you’re sampling internally
using run_nested()
and printing out progress
if sampling externally using, e.g., sample()
.
If the bounding distribution is updating frequently and you’re using more
computationally intensive methods such as 'multi'
, some of this might be
due to excessive overhead associated with constructing the bounds. This can
be reduced by increasing update_interval
.
If the overall sampling efficiency is low (relative to what you’d expect), it
might indicate that the distribution used (e.g., 'single'
) isn’t effective
and more complex ones such as 'multi'
should be used instead. If you’re
already using those but still getting inefficient proposals, that might
indicate that the bounding distribution are struggling to capture the
target distribution. This can happen if, e.g., the posterior occupies a thin,
strongly-curved manifold in several dimensions, which is hard to model with
a series of overlapping ellipsoids or other similar distributions.
Another possible culprit might be the enlargement factors. While the default
25% value usually doesn’t significantly decrease the efficiency, there some
exceptions. If you are instead deriving expansion factors from bootstrapping,
it’s possible you’re experiencing severe Monte Carlo noise (see
Bounding Questions). You could try to resolve this by either using
more live points or switching to an alternate sampling method less sensitive
to the size of the bounding distributions such as 'rwalk'
or 'rslice'
.
If sampling progresses efficiently after the first bounding update (i.e. when
bound > 0
) for the majority of the run but becomes substantially less
efficient near the final dlogz
stopping criterion, that could be a sign that
the the current set of live points are unable to give rise to bounding
distributions that are detailed enough to track the shape of the remaining
prior volume. As above, this behavior could be remedied by using more live
points or alternate sampling methods. Depending on the goal, the dlogz
tolerance could also be adjusted.
Finally, if sampling seems to be progressing efficiently but is just
taking a long time, it might be because the high-likelihood regions of
parameter space are small compared to the prior volume. As discussed in
Priors in Nested Sampling, the time it takes to sample to a
given dlogz
tolerance scales as the “information” gained by updating from
the prior to the posterior. Since Nested Sampling starts by sampling from the
entire prior volume, having overly-broad priors will increase the runtime.
I noticed that the number of iterations and/or function calls during a run
don’t exactly match up with the limits I specify using,
e.g., maxiter
or maxcall
. Is this a bug?
No, this is not a bug (i.e. this behavior is not unintended).
When proposing a new point, dynesty
currently only
checks the stopping criterion specified (whether iterations or function calls)
after that point has been accepted. This can also happen when using the
DynamicSampler
to propose a new batch of points,
since the first batch of points need to be allocated before checking the
stopping criterion.
I find other sampling are inefficient relative to `’unif’`. Why would I ever want to use them?
The main reason these methods are more inefficient than uniform sampling is that they are designed to sample from higher-dimensional (and somewhat more “difficult”) distributions, which is inherently challenging due to the behavior of Typical Sets. Broadly speaking, these methods are actually reasonably efficient when compared to other (non-gradient) sampling methods on similar problems (see, e.g., here).
In addition, it is also important to keep in mind that samples from dynesty
are nominally independent (i.e. already “thinned”). As a reference point,
consider an MCMC algorithm with a sampling efficiency of 20%. While this
might seem more efficient than the 4% default target efficiency of 'rwalk'
in dynesty
, the output samples from MCMC are (by design) correlated.
If the resulting MCMC chain needs to be thinned by more than a factor of 5
to ensure independent samples, its “real” sampling efficiency is actually
then below the 4% nominally achieved by dynesty
. This is discussed
further in the release paper.
How many walks (steps) do you need to use for 'rwalk'
?
In general, random walk behavior leads to excursions from the mean at a rate
that scales as (roughly) \(\sqrt{n} \sigma\) where \(n\) is the number
of walks and \(\sigma\) is the typical length scale. The number of steps
needed then roughly scales as \(d^2\). In general this behavior doesn’t
dominate unless sampling in high (\(d \gtrsim 20\)) dimensions. In lower
dimensions (\(d \lesssim 15\)), walks=25
is often sufficient, while in
moderate dimensions (\(d \sim 15-25\)) walks=50
or greater are often
necessary to maintain independent samples.
What are the differences between 'slice'
and PolyChord?
Our implementation of multivariate slice sampling more closely follows the prescription in Neal (2003) than the algorithm outlined in the PolyChord paper. We conservatively enforce a strict Gibbs updating scheme that requires sampling from all 1-D conditional distributions (in random order); we term this entire update a “slice”. This enables us to rigorously satisfy detailed balance at the cost of being less efficient.
We also treat mode identification and sampling a little differently than
PolyChord. In dynesty
our bounding objects are used to track modes as well
as a set of orthogonal basis vectors characterizing that mode. Slicing then
takes place along that specific basis, allowing us to sample efficiently even in
a multi-modal context. For PolyChord, mode identification works using a
slightly different clustering algorithm and sampling takes place in a
“pre-whitened” space based on the derived orthogonal basis.
Our implementation of 'rslice'
more closely follows the method
employed in PolyChord.
How many slices (“repeats”) do you need to use for 'slice'
?
Since slice sampling is a form of non-rejection sampling,
the number of “slices” requires for Nested Sampling is
(in theory) independent of dimensionality and can remain relatively constant.
This is especially true if there are a set of local principle axes
that can be effectively captured by the bounding distributions
(e.g., 'multi'
). There are more pathological cases, however,
where the number of slices can weakly scale with dimensionality. In general
we find that the default (and conservative) slices=3
is robust under a wide variety of circumstances. Note that for the
'slice'
sampler slices=3 means that slice steps will be done 3 times over
each of the dimension of the problem (N). I.e. the total number of the moves
will be 3*N. Also note that for the 'rslice'
sampler the default
is slices=3+N
steps as 'rslice'
does not loop over each of the dimension,
as it chooses the move directions randomly.
The stopping criterion for Dynamic Nested Sampling is taking a long time to evaluate. Is that normal?
This might mean you are using a version of dynesty
below v1.2 or
you are using a large number of simulations to estimate the errors.
In earlier versions, the stopping criteria was much more computationally
intensive to evaluate. However, in both earlier and current versions, using
(1) large numbers of simulations with (2) large numbers of samples
with (3) a large number of varying live points can make the stopping criteria
difficult to evaluate quickly. See
Nested Sampling Errors for additional details.
I’m trying to sample using gradients but getting extremely poor performance. I thought gradients were supposed to make sampling more efficient! What gives?
While gradients are extremely useful in terms of substantially improving the scaling of most sampling methods with dimensionality (gradient-based methods have better polynomial scaling than non-gradient slice sampling, both of which are substantially better over the runaway exponential scaling of random walks), it can take a while for these benefits to really kick in. These scaling arguments generally ignore the constant prefactor, which can be quite large for many gradient-based approaches that require integrating along some trajectory, often resulting in (at least) dozens of function calls per sample. This often makes it more efficient to run simpler sampling techniques on lower-dimensional problems. In general, Nested Sampling methods are also unable to exploit gradient-based information to the same degree as Hamiltonian Monte Carlo approaches, which further degrades performance and scaling relative to what you might naively expect.
If you feel like your performance is poorer than expected even given these
caveats, or if you notice other results that make you highly suspicious of the
resulting samples, please double-check the Sampling with Gradients
page to make sure you’ve passed in the correct log-likelihood gradient and are
dealing with the unit cube Jacobian properly. Failing
to apply this (or applying it twice) violates conservation of energy and
momentum and leads to the integration timesteps along the trajectories
changing in undesirable ways.
It’s also possible the numerical errors in the Jacobian (if you’ve set
compute_jac=True
) might be propagating through to the computed trajectories.
If so, consider trying to compute the analytic Jacobian by hand to reduce
the impact of numerical errors.
If you still find subpar performance, please feel free to open an issue.
Live Point Questions
How many live points should I use?
Short answer: it depends.
Longer answer: Unfortunately, there’s no easy answer here. Increasing the number of live points helps establish more flexible and robust bounds, improving the overall sampling efficiency and prior volume resolution. However, it simultaneously increases the runtime. These competing behaviors mean that compromises need to be made which are problem-dependent.
In general, for ellipsoid-based bounds an absolute minimum of ndim + 1
live points is “required”, with 2 * ndim
being a (roughly) “safe” threshold.
If bootstraps are used to establish bounds while sampling uniformly, however,
many (many) more live points should be used.
Around 50 * ndim
points are recommended for each expected mode.
Methods that do not depend on the absolute size of the bounds (but instead rely
on their shape) can use fewer live points. Their main restriction is
that new live point proposals (which “evolve” a copy of an existing live point
to a new position) must be independent of their starting point. Using too
few points can require excessive thinning, which quickly negates
the benefit of using fewer points if speed is an issue.
10 * ndim
per mode seems to work reasonably well, although
this depends sensitively on the amount of prior volume that has to be
traversed: if the likelihood is a set of tiny islands in an ocean of
prior volume, then you’ll need to use more live points to avoid missing them.
See LogGamma, Eggbox, or Exponential Wave for
some examples of this in practice.
Bounding Questions
What bounds should I be using?
Generally, 'multi'
(multiple ellipsoid decomposition) is the most
adaptive, being able to model a wide variety of behaviors and complex
distributions. It is enabled in dynesty
by default.
For simple unimodal problems, 'single'
(a single bounding ellipsoid)
can often do quite well. It also helps to guard against cases where
methods like 'multi'
can accidentally “shred” the posterior into many pieces
if the ellipsoid decompositions are too aggressive.
For low-dimensional problems, ensemble methods like 'balls'
and 'cubes'
can be quite effective by allowing live points themselves
to create “emergent” structure. These can create more flexible shapes than
'multi'
, although they have trouble modeling separate structures with
wildly different shapes.
In almost all cases, using no bound ('none'
) should be seen as a fallback
option. It is mostly useful for systematics checks or in cases where the
number of live points is small relative to the number of dimensions.
What are the differences between 'multi'
and MultiNest, nestle, etc.?
The multi-ellipsoid decomposition/bounding method implemented in dynesty
is entirely based on the algorithm implemented in nestle which itself is based on the algorithm
described in Feroz, Hobson & Bridges (2009). As such, it doesn’t include any
improvements, changes, etc. that may or may not be included in
MultiNest.
Specifically, it uses a simple scheme based on iterative k-means
clustering than some of the more robust methods based on agglomerative
clustering
implemented by some other codes such as
UltraNest.
In addition, there are a few differences in the portion of the algorithm that
decides when to split an ellipsoid into multiple ellipsoids. As with
nestle
, the implementation in dynesty
is more conservative about
splitting ellipsoids to avoid over-constraining the remaining prior volume and
also enlarges all the resulting ellipsoids by a constant volume prefactor.
It also recomputes the ellipsoids from scratch each time there is a
bounding update, rather than using ellipsoids from previous iterations.
In general this results in a slightly lower sampling efficiency but greater
overall robustness.
dynesty
also uses different heuristics than MultiNest
or MultiNest
when deciding, e.g., when to first construct bounds. By default, dynesty
waits until the efficiency hits 10% and a certain number of iterations have
passed before deciding to try split up live points into any sort of
ellipsoid decomposition. This helps to avoid problems with “shredding” the
early set of live points (which tend to be quite dispersed) into an enormous
set of ellipsoids but can substantially affect the runtime for simple problems
with tight priors. See Bounding Options for additional details as well
as the answer below.
Finally, dynesty
regularizes the ellipsoids based on their
condition number
to avoid issues involving numerical instability. This can reduce the sampling
efficiency for problems with very skewed distributions (i.e. large axis ratios)
but helps to ensure stable performance.
No matter what bounds, options, etc. I pick, the initial samples all come from `bound = 0` and continue until the overall efficiency is quite low. What’s going on here?
By default, dynesty
opts to wait until some time has passed until
constructing the first bounding distribution.
This behavior is designed to avoid constructing overly large bounds that often
significantly exceed the confines of the unit cube, which can lead to excessive
time spent generating random numbers early in a given run.
Prior to constructing the initial bound,
samples are proposed from the unit cube, which is taken to be bound = 0
.
The options that control these
heuristics can be modified using the first_update
argument.
During a run I sometimes see the bound index jump forward several places. Is this normal?
To avoid getting stuck sampling from bad bounding distributions (see above),
dynesty
automatically triggers a bounding update whenever the number of
likelihood calls exceeds update_interval
while sampling from a particular
bound. This can lead to multiple bounds being constructed before the sample
is accepted.
A constant expansion factor seems arbitrary and I want to try out bootstrapping. How many bootstrap realizations do I need?
Sec. 6.1 of Buchner (2014) discusses
the basic behavior of bootstrapping and how many iterations are needed to
ensure that realizations do not include the same live point over some number
of realizations. bootstrap = 20
appears to work well in practice, although
this is more aggressive than the bootstrap = 50
recommended by
Buchner.
When bootstrapping is on, sometimes during a run a bound will be really large. This then leads to a large number of log-likelihood calls before the bound shrinks back to a reasonable size again. Why is this happening? Is this a bug?
This isn’t (technically) a bug, but rather Monte Carlo noise associated with the bootstrapping process. Depending on the chosen method, sometimes bounds can be unstable, leading to large variations between bootstraps and subsequently large expansions factors. Some of this is explored in the Gaussian Shells and Hyper-Pyramid examples. In general, this is a sign that you don’t have enough live points to robustly determine your log-likelihood bounds at a given iteration, and should likely be running with more. Note that “robustly” is the key word here, since it can often take a (some might find “excessively”) large number of live points to confidently determine that you aren’t missing any hidden prior volume.
Pool/Parallelization Questions
My provided pool
is crashing. What do I do?
First, check that all relevant variables, functions, etc. are properly
accessible and that the pool.map
function is working as intended. Sometimes
pools can have issues passing variables to/from members or executing tasks
(a)synchronously depending on the setup.
Second, check if your pool has issues pickling some types of functions
or evaluating some of the functions in sampling
. In general,
nested functions require more advanced pickling (e.g., dill
),
which is not enabled with some pools by default.
If those quick fixes don’t work, feel free to raise an issue. However, as multi-threading and multi-processing are notoriously difficult to debug, especially on a problem I’m not familiar with, it’s likely that I might not be able to help all that much.
How to decide on the number of processes in a pool and how to set queue_size
Assuming that you decided on the number of live-points K that you want to use and that the likelihood evaluation is not very quick, you should use as many processes as you can up to around K. The queue_size should be equal the number of processes. If you are using the the number of processes that M is smaller than K, you may want to use \(M=K//2\) or \(M=K//3\) i.e integer fractions. So if you are using 1024 live-points all powers of two up to 1024 would be good choice for the number of processes.
I would like to run dynesty across multiple nodes on a cluster. How do I do that ?
The best way is to use the
schwimmbad package
and its MPIPool
. You should be able to use this pool in the same way you would use the multiprocessing.Pool
. (see schwimmbad
docs for more info). Here is a small example:
from schwimmbad import MPIPool
import numpy as np, sys, dynesty
def ptform(x):
return 10 * x - 5
def func(x):
return -0.5 * np.sum(x**2)
if __name__ == '__main__':
pool = MPIPool()
if not pool.is_master():
pool.wait()
sys.exit(0)
dns = dynesty.DynamicNestedSampler(
func, ptform, 10, pool=pool)
dns.run_nested()
When running on a cluster I run into a time limit before dynesty finishes. What should I do?
You should use the checkpointing ability of dynesty to save the state of the sampler during sampling process. Then you should be able to restart the sampling even if it was previously killed by the scheduler.
When trying to use checkpointing I’m receiving errors because my function cannot be pickled
If you receive the error like “Can’t pickle local object”, this is an error that means that python is not able to save the sampler due to the limitations of the python’s pickler. The alternative is to use another pickling module like dill
.
You can easily replace the pickling module by executing this:
import dill
import dynesty.utils
dynesty.utils.pickle_module = dill
before the checkpointing/saving code and that will force dynesty to use dill.