NRT Lectures - Statistical Modeling

Bayesian Hierarchical Model

Rat Tumor Example

In [55]:
import math
import random
import numpy as np
import pandas as pd
# import graphviz
# from pymc3 import model_to_graphviz
import pymc3 as pm
from pymc3 import Model, sample, Beta, Binomial, Exponential, Uniform, summary, plot_posterior, model_to_graphviz, Deterministic
import matplotlib.pyplot as plt
# import os
# os.environ["PATH"] += os.pathsep + 'C:\Program Files\Python37\Lib\site-packages\graphviz\dot.py'
In [33]:
d =  pd.read_table("rattumor.txt", sep = " ")
d = d.iloc[:,:2]
d
Out[33]:
y N
0 0 20
1 0 20
2 0 20
3 0 20
4 0 20
... ... ...
66 16 52
67 15 46
68 15 47
69 9 24
70 4 14

71 rows × 2 columns

In [4]:
d.describe()
Out[4]:
y N
count 71.000000 71.000000
mean 3.760563 24.492958
std 3.811504 10.973830
min 0.000000 10.000000
25% 1.000000 19.000000
50% 3.000000 20.000000
75% 5.000000 22.500000
max 16.000000 52.000000

A naive estimate of $\theta_j$ is $\hat{\theta_j}=y_j/n_j$.

Histogram of $\hat{\theta_j}$:

In [54]:
plt.hist(d.y/d.N, range = (0,1), bins = 18, density=True)
plt.title("Histogram of naive estimate")
plt.show
Out[54]:
<function matplotlib.pyplot.show(*args, **kw)>
In [58]:
with Model() as model1:

    # Priors
    alpha = Exponential('alpha', 0.001)
    beta = Exponential('beta', 0.001)

    theta = Beta('theta', alpha=alpha, beta=beta, shape=71)

    # Data likelihood
    y_like = Binomial('y_like', n=d.N, p=theta, observed=d.y)
In [59]:
random.seed(100)
with model1:
    trace1 = sample(100, tune=100)
Only 100 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [theta, beta, alpha]
Sampling 2 chains, 0 divergences: 100%|███████████████████████████████████████████| 400/400 [21:23<00:00,  3.21s/draws]
The acceptance probability does not match the target. It is 0.9249541348188928, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.9281903582547143, but should be close to 0.8. Try to increase the number of tuning steps.
The rhat statistic is larger than 1.05 for some parameters. This indicates slight problems during sampling.
The number of effective samples is smaller than 10% for some parameters.
In [60]:
summary(trace1)
Out[60]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
alpha 4.201 2.281 1.313 8.401 0.693 0.504 11.0 11.0 13.0 29.0 1.14
beta 24.835 13.556 9.254 49.770 4.308 3.141 10.0 10.0 12.0 31.0 1.15
theta[0] 0.080 0.041 0.005 0.146 0.003 0.002 211.0 211.0 179.0 122.0 1.02
theta[1] 0.080 0.038 0.014 0.144 0.004 0.003 85.0 85.0 94.0 187.0 1.02
theta[2] 0.081 0.043 0.003 0.151 0.006 0.004 48.0 48.0 42.0 95.0 1.05
... ... ... ... ... ... ... ... ... ... ... ...
theta[66] 0.246 0.048 0.171 0.338 0.004 0.003 149.0 149.0 153.0 129.0 1.00
theta[67] 0.262 0.052 0.158 0.353 0.007 0.005 55.0 55.0 54.0 74.0 1.02
theta[68] 0.252 0.053 0.152 0.354 0.006 0.004 89.0 89.0 89.0 83.0 1.03
theta[69] 0.257 0.070 0.131 0.392 0.007 0.005 89.0 89.0 86.0 106.0 1.02
theta[70] 0.200 0.071 0.086 0.336 0.005 0.004 173.0 147.0 163.0 95.0 1.00

73 rows × 11 columns

In [72]:
plot_posterior(trace1['alpha'])
Out[72]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x0000028124DB8E88>],
      dtype=object)
In [64]:
plot_posterior(trace1['beta'])
Out[64]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x00000281205BAF08>],
      dtype=object)
In [90]:
alpha = trace1.get_values(varname='alpha')
beta = trace1.get_values(varname='beta')
plt.scatter(alpha, beta)
plt.xlabel('alpha')
plt.ylabel('beta')
plt.show()
In [91]:
plt.scatter(np.log(alpha/beta), np.log(alpha+beta))
plt.xlabel('log(alpha/beta)')
plt.ylabel('log(alpha+beta)')
plt.show()

Try another prior

In [67]:
with Model() as model2:

    phi1 = Uniform('phi1', lower=0, upper=1)
    phi2 = Uniform('phi2', lower=0, upper=1000)

    alpha = Deterministic('alpha', phi1 / (phi2**2))
    beta = Deterministic('beta', (1-phi1) / phi2**2)

    theta = Beta('theta', alpha=alpha, beta=beta, shape=71)

    # Data likelihood
    y_like = Binomial('y_like', n=d.N, p=theta, observed=d.y)
In [68]:
random.seed(100)
with model2:
    trace2 = sample(100, tune=100)
Only 100 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [theta, phi2, phi1]
Sampling 2 chains, 0 divergences: 100%|███████████████████████████████████████████| 400/400 [55:15<00:00,  8.29s/draws]
The acceptance probability does not match the target. It is 0.9498031914528816, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.9394815263249752, but should be close to 0.8. Try to increase the number of tuning steps.
The rhat statistic is larger than 1.05 for some parameters. This indicates slight problems during sampling.
The number of effective samples is smaller than 25% for some parameters.
In [69]:
summary(trace2)
Out[69]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
phi1 0.145 0.015 0.117 0.171 0.001 0.001 187.0 187.0 197.0 181.0 1.00
phi2 0.270 0.042 0.197 0.358 0.009 0.006 23.0 23.0 22.0 80.0 1.10
alpha 2.132 0.717 1.073 3.397 0.147 0.105 24.0 24.0 25.0 97.0 1.09
beta 12.704 4.269 5.809 19.931 0.921 0.660 21.0 21.0 22.0 80.0 1.10
theta[0] 0.054 0.037 0.003 0.119 0.002 0.002 271.0 254.0 214.0 130.0 1.01
... ... ... ... ... ... ... ... ... ... ... ...
theta[66] 0.270 0.050 0.180 0.361 0.002 0.002 460.0 460.0 460.0 149.0 1.01
theta[67] 0.282 0.067 0.166 0.422 0.003 0.003 460.0 355.0 460.0 155.0 1.00
theta[68] 0.286 0.053 0.187 0.378 0.003 0.002 335.0 335.0 365.0 144.0 1.00
theta[69] 0.292 0.078 0.148 0.446 0.004 0.003 455.0 411.0 440.0 182.0 1.01
theta[70] 0.210 0.072 0.080 0.334 0.005 0.004 224.0 175.0 268.0 59.0 1.03

75 rows × 11 columns

In [74]:
plot_posterior(trace2['alpha'])
Out[74]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x0000028119CECC48>],
      dtype=object)
In [75]:
plot_posterior(trace2['beta'])
Out[75]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000002811B7B35C8>],
      dtype=object)
In [83]:
alpha2 = trace2.get_values(varname='alpha')
beta2 = trace2.get_values(varname='beta')
plt.scatter(np.log(alpha2/beta2), np.log(alpha2+beta2))
plt.xlabel('log(alpha/beta)')
plt.ylabel('log(alpha+beta)')
plt.show()