Homework 1
Due 2023 September 8
Report instructions
Please upload one PDF file to Gradescope (Entry code:
K326P6).
For this homework you will implement code (on PrairieLearn) to
perform statistical analysis of data, which is representative of
analyzing simulation output. You specifically should implement functions
that compute the mean value, the standard deviation, the
auto-correlation time, and the standard error (please see PrairieLearn for
details and instructions). For your report you will be asked to discuss
a couple of specific problems related to your analysis.
Data set 1
- Discuss the change in the estimated error of the mean, when you use
only the first 500 data points to calculate the mean value.
- Discuss the change in the standard deviation, when you only use the
first 500 data points.
- How would you expect these quantities to change as you gradually
remove more and more data points from the data set (until only 500 are
left)? Why?
Data set 2
- Discuss the change in the error of the estimated mean when you
include/exclude correlation.
Data set 3
- Discuss where you would set the initial cutoff for this data set and
why.
- Discuss whether the difference between the two mean values (with and
without cutoff) is significant and why.
Data set 4
A data set (download here) was sampled
from the distribution with and . The constant is determined by the normalization.
- Based on the analytic expression given above, what do you expect for
the mean and why?
- What do you expect for the variance? (Hint: For the variance,
the behavior at large matters, so
one can use an approximation for the denominator of the distribution
function.)
- Look at the convergence of the mean and by computing these values for five
“end cutoffs” from 1000 to 5000 (i.e., use data points 0–999, 0–1999,
0–2999, etc). Do the same for your data sets of the problems “Dataset 1:
Mean value” and “Dataset 2: Autocorrelation time” on PrairieLearn and
compare the convergence behavior!
Central Limit Theorem
Given a population with a mean and a finite, non-zero variance , the sampling distribution of the
mean approaches a normal distribution with a mean of and a variance of = as , the sample size, increases.
This is the Central Limit Theorem and implies that the estimated mean
approaches a Gaussian distribution as more points are used.
Now suppose you have you have 2 versions and of a code that you’re debugging. You
run each code 6 times to try to determine if they give the same answers.
This yields:
Compute the mean, variance, and the estimate of the error of the
mean for and separately, assuming each run is
uncorrelated with the others and report these results.
Show that the probability that the two runs are (NOT) drawn from
the same distribution is about 29% (71%).
To do this, first find how many standard deviations the difference is
from zero; do this by dividing the “estimate of the difference” by the
“estimate of the error of the difference.” From this number determine
the probability that the two are from the same distribution using a
Normal Standard Probability Distribution Table (often referred to as
or the Error Function ). A detailed
explanation of how to do this can be found here/PDF.