Homework 4
Due 2023 October 20
Report instructions
Please upload one PDF file to Gradescope (Entry code:
K326P6).
For this homework you will implement code (on PrairieLearn) for a
pseudo-random number generator (RNG) and assess the quality of a stream
of random numbers from a particular RNG. You will also study the
technique of importance sampling. Random numbers and importance sampling
are essential ingredients of Monte Carlo simulation techniques and in
the next homework, you will modify your MD code to do Monte Carlo
simulations.
Introduction
On PrairieLearn,
you are going to write (and test) a RNG that uses the linear congruent
generator (LCG) algorithm. An LCG generates a random number by taking the current random
number and performing the
following arithmetic, where is the
upper limit of possible values of , , and ; denotes the modulus. This algorithm
requires choosing a number to
start the stream, called the seed.
The quality of the LCG algorithm is very sensitive to the choice of
these parameters. For instance, your period can never be longer than
.
Gaussian Random Number
Generator
If you have a stream of uniformly distributed random numbers, it is
possible to apply a transformation to get a stream of random numbers
that obeys some other distribution.
In simulations of physical systems at thermodynamic equilibrium, it
is useful to have random numbers drawn from a Gaussian distribution.
There are some standard methods to accomplish this, and you should use
textbooks and the internet to learn about these algorithms.
After implementing a Gaussian RNG, you will verify that it works by
plotting a histogram. The correct normalization of the histogram is not
obvious. If you have random
numbers, and your histogram contains bins over a range , you should multiply by the
normalization factor .
Testing Random Number
Generators
To test a RNG that claims to produce uniformly distributed random
numbers on the interval [0,1), we can test the uniformity of the
distribution by creating a histogram of the stream of random numbers. We
create a series of evenly-spaced bins between [0,1) and we generate a
stream of random numbers, binning each value appropriately. If the RNG
stream is indeed uniformly distributed, we should observe the same
number of random numbers in each bin.
Of course, if we actually try to do this empirically, we will find
that not every bin gets the same number of counts. Instead, there will
be fluctuations that cause some bins to get more counts than others. On
the other hand, a RNG stream that is not truly uniform will also lead to
a histogram with unequal counts in each bin. We want to distinguish
between non-uniformity that occurs because of statistical fluctuations
(i.e., having only a finite sample size) and non-uniformity due to a
broken random number generator. We use the following metric to
distinguish between these two cases: where is the number of counts in bin and is the expected number of counts in
bin (the total number of points
divided by the number of bins for our case of the uniform
distribution).
If is true for all
, the statistic approaches the eponymous
chi-squared distribution with degrees of freedom. The
statement of the goodness-of-fit can be simplified to this: If this
value is greater than some
critical value, then with some quantifiable confidence this indicates
that the generator of this stream is not what it is claimed to be (i.e.,
a uniform distribution of ). We can try solving for these critical values
numerically by integrating the probability distribution function for the
chi-squared distribution, or for degrees of freedom up to 100 just use
the tabulated values provided by NIST.
For this homework, let us choose the 90% confidence level for the
hypothesis that the generator is not broken (random). Critical values of
should be taken from the
column labeled “0.10”.
One can also bin numbers in more than 1 dimension. For example, for 2
dimensions one could take pairs of numbers and bin them onto a grid.
This is useful because it will expose correlations between successive
numbers in the series. This is also a serious problem, because we
generally use a RNG stream with the assumption that successive numbers
are independent of one another.
You will write functions that compute the metric also for histograms in two
and three dimensions. Note that the test becomes increasingly
stringent in higher dimensions, and RNGs judged suitable for research
purposes are generally required to pass the test in 10 dimensions. You may
find yourself working with degrees of freedom greater than 100 (off the
NIST table mentioned above), in which case you will have to calculate
the values yourself by numerical integration or use the critical value
calculator found at Chi
Square Calculator.
For 90% confidence level, you should set the “probability” parameter
to 0.9. (Note: Do not set it to 0.10 as you did above, where it
represented the upper tail integral.).
Importance Sampling.
You will work to compute the integral
which cannot be done analytically.
First, you will estimate the value of the continuous integral by
performing a summation of the function at a discrete set of grid points:
where is the appropriate normalization
involving and .
Second, you will compute this integral using Monte Carlo with
importance sampling. Given an integral ,
previously we will have to choose a bound , and discretize the x-axis with evenly spaced bins. There are usually
two problems with this approach. Firstly, a good or is not always known beforehand.
Secondly, if the function is
known to have sharp peaks somewhere, naturally one would need more bins
around these areas, but it would be expensive to increase the density of
bins everywhere.
The idea of importance sampling is that if we want to evaluate we
can rewrite the integral as where is a probability distribution (i.e.,
positive for all x and normalized such that )
and is the
estimator.
Using statistics, one recognizes that where is the
expectation value of under the probability
distribution depicted by . To
actually calculate the expectation value, one uses the following
estimate where
random numbers are sampled according to the
probability distribution depicted by .
Essentially, one turns a integration problem into a sampling problem.
It is called importance sampling, because the variance is minimum when
one chooses a such that , in the sense that we
sample more points where is
the largest in magnitude and sample less points where is small in magnitude. In reality,
this is not practical, the next best thing one can do is to match the
shapes and the locations of the peaks of and as closely as possible.
For our integrand , we will choose where is chosen so
that is a valid probability
distribution (i.e., normalized correctly).
The utility of doing this is that we know how to sample a Gaussian
distribution using a Gaussian random number generator.
Thus, for Monte Carlo integration with importance sampling, we will
sample random configurations distributed according to the function , and we will evaluate the integral
by computing the expectation
value of the estimator : .
Your report
- Discuss and show how long it takes before LCG(16,3,1,2) produces
repeating values! Are all the numbers between 0 and 15 represented?
- To show that your Gaussian RNG produces the correct distribution,
generate a stream of 10,000 numbers with mean 0 and standard deviation 1
and histogram the values. If normalized correctly, your histogram should
match the function .
Produce a plot of your histogram and compare with the Gaussian
function.
- Generate a stream of random numbers on the interval [0,1) using your
LCG function and report the chi-squared values in 1, 2, and 3
dimensions. Perform the same tests on the random.random() module that
comes with Python. For each stream, state explicitly if the RNG passes
the tests. Plot the numbers
of your PrairieLearn problems
4.8, 4.9, and 4.10 binned in 2 dimensions (for 4.8,
make the streamed values into pairs; for 4.10, just use two of the three
members in the triplet). Use the first 1000 points of the supplied data
sets for plotting.
- For your numerical integration, choose finite limits of integration
that you believe are a good
approximation to the infinite limits and justify why. Also find the
correct expression for , the
normalization.
- Report the result of the numerical integration with your choice of
parameters when using the rectangle rule.
- Determine an expression for
that normalizes .
- Using the functional forms of and given above, write down an analytic
expression for the estimator .
Also write down an expression for the estimator of the variance of computed with Monte Carlo integration
sampling the distribution .
- Compute the value of the integral with Monte Carlo integration using
your estimator for the mean and your Gaussian distribution of random
numbers. Also report the variance of your answer using the appropriate
variance estimator. Notice, that for different alpha’s you are using
different importance functions and hence will get a different variance.
Graph the variance versus alpha and submit this with your code.
- You should optimize your importance sampling by finding the
parameter that minimizes the
variance. Choose at least 8 different values of between 0 and 2 and report your
estimate of the integral and the variance for each. Submit a plot of the
variance as a function of .
Based on this plot, report your optimal choice of .
- If the variance were infinite for certain values of , we would not know it from the
results above. As a first test, run your simulation again with 4 times
as many steps. If your variance is well defined, then we know that the
variance should be effectively independent of how long you run it. If
this turns out not to be the case, then you should worry that you have
infinite variance! Submit a graph that graphs alpha versus (the ratio
your calculated variance with 4 times as many steps to the original
calculated variance). Also indicate where you have infinite
variance.
- Calculate the variance analytically, expressing your answer as a
function of . State the range
of values of alpha for which the variance is infinite.
- Now that you have located values of alpha where the variance is
infinite, we will look to see how this manifests itself in the numerical
data. Make a trace plot of versus sample number for a
value of that has infinite
variance and a value of that
has finite variance. Describe the qualitative difference between these
two situations.