Bayesian Theory
Bayesian Inference: Priors, Posteriors, and Predictive Analysis
Bayesian analysis has several advantages over classical analysis in the context of clinical trials. These include the ability to incorporate prior information about treatment efficacies into the analysis, the flexibility to make multiple unscheduled inspections of accumulating data without increasing the error rate, and the capability to calculate the probability that one treatment is more effective than another.
In contrast to classical methods, Bayesian analysis is conditional on the observed data and focuses on the probability that a conclusion or hypothesis is true given the available data. Classical inference, however, is not conditional on the observed data but instead concerns the behavior of a statistical procedure over an infinite number of repetitions, considering all potential data that might have been observed under a hypothesis. Bayesians deal with the probabilities of hypotheses given a dataset, whereas frequentists concern themselves with the probabilities of datasets given a hypothesis.
1. Overview of Bayesian and Classical Analysis
2. Methodological Contrasts
3. Limitations in Classical Hypothesis Testing
4. Advantages of Bayesian Methodology
5. Implementation and Impact
Bayes’ theorem can be expressed mathematically in the context of Bayesian inference as:
\[ \overbrace{p(\theta/D)}^{\text{Posterior}} = \frac{\overbrace{p(D/\theta)}^{\text{Likelihood}} \cdot \overbrace{p(\theta)}^{\text{Prior}}}{\underbrace{p(D)}_{\text{Evidence}}} \]
This equation is fundamental to Bayesian analysis and can be broken down into three critical components:
The Prior Distribution in Bayesian statistics is a fundamental component that encapsulates our knowledge or beliefs about a parameter before we observe any data. It’s an expression of our subjective or objective preconceptions about the values that a parameter, typically denoted as \(\theta\), might take based on previous experience, existing knowledge, or expert opinion.
Characteristics of Prior Distributions
Role of Prior Distribution in Bayesian Analysis
Bayesian inference updates the prior probability distribution based on newly acquired data. According to Bayes’ theorem, the posterior distribution \(p(\theta| x)\) is proportional to the product of the likelihood \(p(x | \theta)\) and the prior distribution \(p(\theta)\), expressed as:
\[ p(\theta|x) \propto p(x|\theta)p(\theta) \]
The integral of the likelihood function over the prior distribution, normalized by the probability of the data \(p(x)\), provides the marginal likelihood:
\[ p(x) = \int p(x|\theta)p(\theta) d\theta \]
This marginal likelihood \(p(x)\) serves as the normalizing constant for the posterior distribution:
\[ p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)} \]
Bayesian inference effectively incorporates nuisance parameters by integrating them out of the joint posterior to provide a marginal posterior distribution \(p(\theta_1 | x)\), which is calculated by integrating over the nuisance parameter \(\theta_2\):
\[ p(\theta_1 | x) = \int p(\theta_1, \theta_2 | x) p(\theta_2 | d\theta_2) \]
The predictive distribution in Bayesian inference uses the posterior distribution to make predictions about future data. This approach is inherently probabilistic, reflecting the uncertainty inherent in the data and the model parameters. Key elements of Bayesian prediction include:
In Bayesian inference, we often deal with two mutually exclusive and exhaustive hypotheses:
Let:
Bayes’ Theorem allows us to update our beliefs after observing data \(y\). The posterior probability for \(H_0\) is:
\[ p(H_0 \mid y) = \frac{p(y \mid H_0) \cdot p(H_0)}{p(y)} \]
Where the denominator \(p(y)\) is the marginal likelihood or evidence:
\[ p(y) = p(y \mid H_0) \cdot p(H_0) + p(y \mid H_1) \cdot p(H_1) \]
Since \(H_1 = \text{not } H_0\), we also have \(p(H_1) = 1 - p(H_0)\).
Bayes’ Theorem can also be written in terms of odds:
\[ \frac{p(H_0 \mid y)}{p(H_1 \mid y)} = \frac{p(y \mid H_0)}{p(y \mid H_1)} \cdot \frac{p(H_0)}{p(H_1)} \]
In this expression:
This version shows that:
\[ \text{Posterior odds} = \text{Likelihood ratio} \times \text{Prior odds} \]
Taking the logarithm of both sides:
\[ \log(\text{Posterior odds}) = \log(\text{Likelihood ratio}) + \log(\text{Prior odds}) \]
The term \(\log(\text{Likelihood ratio})\) is known as the weight of evidence, a concept first introduced by Alan Turing during World War II in his work on decoding the Enigma machine.
A typical graph of Bayes’ Theorem under different likelihood
ratios illustrates how prior and posterior probabilities
relate:
Each curve represents a different likelihood ratio (e.g., 1, 5, 10, 20, 50).
Interpretation examples:
The result of Bayesian inference depends not only on the strength of the evidence (likelihood ratio) but also heavily on the initial belief (prior probability).
In particular:
In Bayesian hypothesis testing, the likelihood of two competing hypotheses \(H_0\) and \(H_1\) (for example, \(H_0: \theta = 0\) and \(H_1: \theta = 1\)) is evaluated based on the data, taking into account prior beliefs about these hypotheses.
Bayes Factors: is a crucial metric in Bayesian hypothesis testing. It is used to provide a quantifiable measure of support for one hypothesis over another, based on the observed data.
The Bayes Factor is essentially the same as the likelihood ratio when comparing two simple hypotheses. It quantifies the evidence provided by the data in favor of one hypothesis over another:
\[ \text{Bayes Factor (BF)} = \frac{p(y \mid H_0)}{p(y \mid H_1)} \]
Interpretation:
This is also referred to by Cornfield as the “relative betting odds” and is central to Bayesian inference.
Bayes factors are used to transform prior odds into posterior odds:
\[ \frac{p(H_0 \mid y)}{p(H_1 \mid y)} = \text{Bayes Factor} \times \frac{p(H_0)}{p(H_1)} \]
This shows how evidence from data (through the BF) updates our prior beliefs to form our posterior beliefs.
Interpreting the Bayes Factor: Jeffreys’ Scale
The strength of evidence indicated by the Bayes factor can be interpreted using Harold Jeffreys’ scale (see Table 3.2):
Bayes Factor (BF) | Evidence Strength (in favor of \(H_0\)) |
---|---|
> 100 | Decisive |
32 to 100 | Very strong |
10 to 32 | Strong |
3.2 to 10 | Substantial |
1 to 3.2 | Barely worth mentioning |
Bayes Factor (BF) | Evidence Strength (in favor of \(H_1\)) |
---|---|
1 to 1/3.2 | Barely worth mentioning |
1/3.2 to 1/10 | Substantial |
1/10 to 1/32 | Strong |
1/32 to 1/100 | Very strong |
< 1/100 | Decisive |
P-values and Bayes factors are two approaches for hypothesis testing and assessing evidence, but they represent fundamentally different statistical paradigms. Here’s a comparison and detailed explanation:
P-Values (Frequentist Paradigm)
A P-value is a frequentist measure used to test a null hypothesis (\(H_0\)):
Bayes Factors (Bayesian Paradigm)
A Bayes factor (BF) is a Bayesian measure of evidence that compares two hypotheses, \(H_0\) (null) and \(H_1\) (alternative):
Definition:
Purpose:
Interpretation:
Bayes Factor Scale (Jeffreys’ interpretation):
Bayes Factor (BF) | Evidence Strength |
---|---|
\(1\) | No evidence |
\(1 - 3\) | Weak evidence for \(H_1\) |
\(3 - 10\) | Moderate evidence for \(H_1\) |
\(> 10\) | Strong evidence for \(H_1\) |
Advantages:
Bayes Factor Example
Comparison: P-Values vs. Bayes Factors
Aspect | P-Values | Bayes Factors |
---|---|---|
Paradigm | Frequentist | Bayesian |
Key Concept | Probability of data given \(H_0\). | Likelihood ratio of data under \(H_1\) and \(H_0\). |
Prior Knowledge | Not considered. | Incorporates prior beliefs. |
Output | Single probability value. | A ratio quantifying evidence. |
Hypotheses Comparison | Only tests against \(H_0\). | Direct comparison of \(H_0\) and \(H_1\). |
Effect of Sample Size | Large sample size can lead to small \(P\)-values even for trivial effects. | Directly incorporates data strength. |
Interpretation | Often misinterpreted as evidence for \(H_1\). | Provides explicit evidence for one hypothesis over another. |
To determine a data’s probability density distribution accurately, two main components are essential:
Form of the Probability Density Function (PDF):
Parameters of the PDF:
In statistical modeling, estimating unknown parameters is a common task:
Bayesian models are complex because they consider a distribution of parameter values rather than a single point estimate. This complexity often necessitates sophisticated computational techniques:
As more data is observed, Bayesian models provide clearer inference about the distribution of parameters, known as posterior inference. This approach contrasts with other predictive strategies:
Maximum Likelihood Estimation (MLE):
Maximum a Posteriori (MAP):
Bayesian Model:
Comparison and Applications
1. Concept of MLE:
2. Assumptions in MLE:
3. Formulation:
4. Properties of MLE:
5. Limitations:
MAP Concept and Methodology:
Differences:
Bayesian modeling distinguishes itself by considering the entire distribution of the data \(X\) and the parameters \(\theta\) instead of just optimizing parameter values based on a given sample set \(X\). This approach naturally helps prevent overfitting by integrating over all possible parameter values rather than selecting a single optimal set.
Advantages of Bayesian Modeling
Making predictions is a central goal of statistical modeling. In the Bayesian framework, predictions are straightforward and intuitively built upon the prior and posterior distributions.
Suppose you have observed some data \(y\) and wish to predict future observations \(x\). The predictive distribution is:
\[ p(x \mid y) = \int p(x \mid \theta) \cdot p(\theta \mid y) \, d\theta \]
This is a weighted average of the sampling distribution \(p(x \mid \theta)\) over the posterior distribution of the parameter \(\theta\). It accounts for both:
If \(x\) and \(y\) are conditionally independent given \(\theta\), then:
\[ p(x \mid y) = \int p(x \mid \theta) \cdot p(\theta \mid y) \, d\theta \]
This integration provides a posterior predictive distribution, useful for:
Suppose:
Then:
The expected number of successes is:
\[ E(Y_n) = n \cdot \mu \]
For a single future trial, the predicted probability of success is simply the current mean:
\[ P(Y = 1) = E(\theta) \]
If your current distribution for \(\theta\) is Beta[a, b], the predictive distribution for the number of future successes \(Y_n\) follows a beta-binomial distribution:
\[ p(y_n) = \binom{n}{y_n} \cdot \frac{B(a + y_n, b + n - y_n)}{B(a, b)} \]
Where \(B(a, b)\) is the Beta function.
Key properties:
Mean:
\[ E(Y_n) = n \cdot \frac{a}{a + b} \]
Variance:
\[ V(Y_n) = n \cdot \frac{ab}{(a + b)^2} \cdot \frac{a + b + n}{a + b + 1} \]
Laplace’s Law of Succession is a special case: if all \(m\) past outcomes were successes and the prior is uniform (Beta[1, 1]), the posterior becomes Beta[m+1, 1], and the predicted success probability for the next trial is:
\[ P(Y = 1) = \frac{m + 1}{m + 2} \]
Example: Drug Trial Prediction (Binary)
Now suppose you want to predict the number of successes in 40 future patients. The prediction follows a beta-binomial distribution with:
If your decision rule is to continue development only if at least 25 out of 40 are successful, the posterior predictive probability of this happening is:
\[ P(Y \geq 25) \approx 0.329 \]
This probability can be computed by summing the right-hand tail of the beta-binomial.
Suppose:
You’re predicting future data \(Y_n\) which is assumed to follow:
\[ Y_n \sim N(\theta, \sigma^2/n) \]
The parameter \(\theta\) itself has a prior:
\[ \theta \sim N(\mu, \sigma^2/n_0) \]
Then the predictive distribution for \(Y_n\) is:
\[ Y_n \sim N\left(\mu, \sigma^2 \left(\frac{1}{n} + \frac{1}{n_0}\right)\right) \]
The mean of the predictive distribution is the prior mean \(\mu\), and the variance reflects both the sampling error of future data and the uncertainty in the parameter estimate.
If you already observed data \(y_m\), and now have a posterior:
\[ \theta \sim N\left(\frac{n_0 \mu + m y_m}{n_0 + m}, \frac{\sigma^2}{n_0 + m}\right) \]
Then the updated predictive distribution becomes:
\[ Y_n \sim N\left(\frac{n_0 \mu + m y_m}{n_0 + m}, \sigma^2\left(\frac{1}{n_0 + m} + \frac{1}{n}\right)\right) \]
This means we’re more confident in our prediction when:
Example: The GREAT Study (Normal Prediction)
Goal: Predict log(OR) for 100 future patients in each trial arm.
Observed mortality ~10% ⇒ around 20 events
Prior + data ⇒ posterior for log(OR): \(N(-0.31, \sigma^2/267.2)\)
Predictive variance:
\[ \sigma^2 \left(\frac{1}{267.2} + \frac{1}{20}\right) = \sigma^2/18.6 \Rightarrow \text{SD} ≈ 0.462 \]
Without prior (flat prior):
Posterior becomes likelihood-based only: \(N(-0.74, \sigma^2/30.5)\)
Predictive variance becomes:
\[ \sigma^2 \left(\frac{1}{30.5} + \frac{1}{20}\right) = \sigma^2/12.1 \Rightarrow \text{SD} ≈ 0.582 \]
This shows:
Probability of achieving OR < 0.5 in the future:
This demonstrates how prior beliefs can significantly influence posterior predictive probabilities, especially in small or uncertain samples.
These describe how the data are generated given certain parameters. They are foundational to both classical and Bayesian statistics, as they provide the likelihood function — the probability of observing the data given specific parameter values.
Standard distributions often used:
Non-standard distributions:
In Bayesian analysis, selecting an appropriate sampling distribution is critical because it influences how information from the observed data is incorporated.
Prior Distributions (Beliefs about Parameters) represent our beliefs about parameter values before observing the data. This is where Bayesian methods diverge significantly from classical methods.
The shape of the prior is crucial, as it reflects the plausibility of different parameter values. It needs to be:
Flexible to represent various forms of uncertainty
Capable of expressing features like:
Common prior distributions:
Distribution | Support | PDF / PMF | Mean | Variance |
---|---|---|---|---|
Binomial | y ∈ {0, …, n} | P(Y=y) = C(n, y) θ^y (1−θ)^(n−y) | nθ | nθ(1−θ) |
Bernoulli | y ∈ {0, 1} | P(Y=1)=θ, P(Y=0)=1−θ | θ | θ(1−θ) |
Poisson | y ∈ {0, 1, 2, …} | P(Y=y) = λ^y e^(−λ) / y! | λ | λ |
Beta | y ∈ (0, 1) | f(y) = [Γ(a+b)/(Γ(a)Γ(b))] y(a−1)(1−y)(b−1) | a / (a + b) | ab / [(a + b)^2 (a + b + 1)] |
Uniform | y ∈ (a, b) | f(y) = 1 / (b − a) | (a + b)/2 | (b − a)^2 / 12 |
Gamma | y > 0 | f(y) = (b^a / Γ(a)) y^(a−1) e^(−by) | a / b | a / b² |
Root-Inverse-Gamma | y > 0 | f(y) = [2b^a / Γ(a)] × y^(−2a−1) × exp(−b / y²) | √b × Γ(a−½)/Γ(a) | b / (a − 1) − [E(Y)]² (defined if a > 1) |
Half-Normal | y > 0 | f(y) = √(2 / πσ²) × exp(−y² / (2σ²)) | σ√(2/π) | σ² (1 − 2/π) |
Log-Normal | y > 0 | f(y) = (1 / y√(2πσ²)) × exp(−(log y − μ)² / (2σ²)) | e^(μ + σ² / 2) | e^(2μ + σ²)(e^σ² − 1) |
Student’s t | y ∈ ℝ | f(y) ∝ [1 + (y−μ)² / (νσ²)]^(−(ν+1)/2) | μ (if ν > 1) | (νσ²) / (ν − 2) (defined if ν > 2) |
Bivariate Normal | (x, y) ∈ ℝ² | f(x,y) = 1 / (2πσ_Xσ_Y√(1−ρ²)) × exp(−Q / [2(1−ρ²)]) | (μ_X, μ_Y) | Var(Y |
A binomial distribution models the number of successes in \(n\) independent trials, each with success probability \(\theta\). If \(Y \sim \text{Binomial}(n, \theta)\), then:
Probability mass function (pmf):
\[ P(Y = y) = \binom{n}{y} \theta^y (1 - \theta)^{n - y}, \quad y = 0, 1, \dots, n \]
Mean:
\[ E(Y) = n\theta \]
Variance:
\[ \text{Var}(Y) = n\theta(1 - \theta) \]
If \(n = 1\), the binomial becomes a Bernoulli distribution.
This delve into the application of Bayesian statistics for Poisson-distributed data using a Gamma distribution as a conjugate prior. This scenario is commonly encountered in settings where the data consists of counts or events that follow a Poisson process, and the prior information about the event rate (λ, lambda) is modeled using a Gamma distribution.
Context and Setup
Mathematical Formulation
Statistical Inference
Models the number of events in a fixed interval when events happen independently and at a constant rate \(\lambda\). If \(Y \sim \text{Poisson}(\lambda)\), then:
Probability mass function:
\[ P(Y = y) = \frac{\lambda^y e^{-\lambda}}{y!}, \quad y = 0, 1, 2, \dots \]
Mean:
\[ E(Y) = \lambda \]
Variance:
\[ \text{Var}(Y) = \lambda \]
This delve into the application of Bayesian statistics for Poisson-distributed data using a Gamma distribution as a conjugate prior. This scenario is commonly encountered in settings where the data consists of counts or events that follow a Poisson process, and the prior information about the event rate (λ, lambda) is modeled using a Gamma distribution.
Example Illustration
Suppose we have a set of counts over ten time periods, and we assume that these counts are Poisson-distributed with an unknown rate λ. Using a Gamma prior for λ with parameters α = 1 and β = 1, we incorporate the observed data to update these parameters.
After observing the data, the posterior parameters become:
If the observed counts sum to 77 over 10 periods, the posterior distribution for λ becomes:
Defined on the interval \((0, 1)\), flexible for modeling probabilities. If \(Y \sim \text{Beta}(a, b)\), then:
Probability density function (pdf):
\[ f(y) = \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} y^{a - 1} (1 - y)^{b - 1}, \quad 0 < y < 1 \]
Mean:
\[ E(Y) = \frac{a}{a + b} \]
Variance:
\[ \text{Var}(Y) = \frac{ab}{(a + b)^2(a + b + 1)} \]
Models equal probability over an interval \((a, b)\). If \(Y \sim \text{Uniform}(a, b)\), then:
Probability density function:
\[ f(y) = \frac{1}{b - a}, \quad a < y < b \]
Mean:
\[ E(Y) = \frac{a + b}{2} \]
Variance:
\[ \text{Var}(Y) = \frac{(b - a)^2}{12} \]
Defined on \((0, \infty)\), often used for modeling rates or waiting times. If \(Y \sim \text{Gamma}(a, b)\) with shape \(a\) and rate \(b\), then:
Probability density function:
\[ f(y) = \frac{b^a}{\Gamma(a)} y^{a - 1} e^{-b y}, \quad y > 0 \]
Mean:
\[ E(Y) = \frac{a}{b} \]
Variance:
\[ \text{Var}(Y) = \frac{a}{b^2} \]
If \(X \sim \text{Gamma}(a, b)\), then \(Y = \frac{1}{\sqrt{X}} \sim \text{RIG}(a, b)\). Then:
Probability density function:
\[ f(y) = \frac{2b^a}{\Gamma(a)} \cdot \frac{1}{y^{2a + 1}} \exp\left( -\frac{b}{y^2} \right), \quad y > 0 \]
Mean (if \(a > \tfrac{1}{2}\)):
\[ E(Y) = \sqrt{b} \cdot \frac{\Gamma(a - \tfrac{1}{2})}{\Gamma(a)} \]
Variance (if \(a > 1\)):
\[ \text{Var}(Y) = \frac{b}{a - 1} - \left( E(Y) \right)^2 \]
If \(X \sim \mathcal{N}(0, \sigma^2)\), then \(Y = |X| \sim \text{HalfNormal}(\sigma^2)\). Then:
Probability density function:
\[ f(y) = \sqrt{\frac{2}{\pi \sigma^2}} e^{-y^2 / (2\sigma^2)}, \quad y > 0 \]
Mean:
\[ E(Y) = \sigma \sqrt{\frac{2}{\pi}} \]
Variance:
\[ \text{Var}(Y) = \sigma^2 \left(1 - \frac{2}{\pi}\right) \]
If \(\log(Y) \sim \mathcal{N}(\mu, \sigma^2)\), then \(Y \sim \text{LogNormal}(\mu, \sigma^2)\). Then:
Probability density function:
\[ f(y) = \frac{1}{y \sqrt{2\pi \sigma^2}} \exp\left( -\frac{(\log y - \mu)^2}{2\sigma^2} \right), \quad y > 0 \]
Mean:
\[ E(Y) = e^{\mu + \sigma^2 / 2} \]
Variance:
\[ \text{Var}(Y) = e^{2\mu + \sigma^2} (e^{\sigma^2} - 1) \]
Defined by location \(\mu\), scale \(\sigma^2\), and degrees of freedom \(\nu\). If \(Y \sim t(\mu, \sigma^2, \nu)\), then:
Probability density function:
\[ f(y) = \frac{\Gamma\left( \frac{\nu + 1}{2} \right)}{\Gamma\left( \frac{\nu}{2} \right) \sqrt{\nu \pi \sigma^2}} \left[ 1 + \frac{(y - \mu)^2}{\nu \sigma^2} \right]^{-\frac{\nu + 1}{2}} \]
Mean (if \(\nu > 1\)):
\[ E(Y) = \mu \]
Variance (if \(\nu > 2\)):
\[ \text{Var}(Y) = \frac{\nu \sigma^2}{\nu - 2} \]
Models two jointly normal variables \(X\) and \(Y\) with correlation \(\rho\). If \((X, Y) \sim \text{BN}(\mu_X, \mu_Y, \sigma_X, \sigma_Y, \rho)\), then the joint density is:
Joint probability density function:
\[ f(x, y) = \frac{1}{2\pi \sigma_X \sigma_Y \sqrt{1 - \rho^2}} \exp\left( -\frac{1}{2(1 - \rho^2)} Q \right) \]
where
\[ Q = \left( \frac{x - \mu_X}{\sigma_X} \right)^2 - 2\rho \left( \frac{x - \mu_X}{\sigma_X} \right) \left( \frac{y - \mu_Y}{\sigma_Y} \right) + \left( \frac{y - \mu_Y}{\sigma_Y} \right)^2 \]
Conditional expectation:
\[ E(Y \mid X = x) = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X) \]
Conditional variance:
\[ \text{Var}(Y \mid X) = \sigma_Y^2 (1 - \rho^2) \]
This distribution is used for modeling two correlated outcomes or as a prior for two dependent parameters.
This provided delve into Bayesian inference for normally distributed data, employing conjugate priors.
Determining the appropriate prior distribution in Bayesian statistics involves considering several options, each suited to different circumstances and amounts of prior knowledge.
1. Non-informative Priors
Non-informative priors, also known as flat or objective priors, are used when there is no specific prior knowledge about the parameters. These priors are designed to exert minimal influence on the posterior distribution, allowing the data to speak for themselves.
2. Conjugate Priors Conjugate priors are chosen because they simplify the computation of the posterior distribution. A prior is conjugate to the likelihood function if the posterior distribution belongs to the same family as the prior distribution.
3. Empirical Bayes Methods
Empirical Bayes methods use the data to estimate the parameters of the prior distribution. This approach sits between fully Bayesian and frequentist methods, leveraging the strengths of both.
4. Expert Elicitation Priors
Priors derived from expert elicitation involve consulting subject-matter experts to quantify their beliefs about parameters before observing the current data. This method is particularly useful in fields where prior experimental or empirical data are sparse, but expert domain knowledge is rich.
When engaging in Bayesian analysis without specific prior knowledge about the parameters in question, employing non-informative priors can be an effective strategy. These types of priors are designed to minimally influence the posterior outcomes, allowing the data itself to primarily drive the inferences. Here is a detailed description of two significant types of non-informative priors: Jeffreys Prior and Reference Prior.
Background: Jeffreys Prior is named after Sir Harold Jeffreys, who introduced it in his work on Bayesian statistics. This prior is particularly noted for its property of invariance, meaning that it remains unchanged under transformation of parameters.
Key Features:
Formula: \[ p(\theta) \propto \sqrt{|I(\theta)|} \] where \(I(\theta)\) is the Fisher information matrix for parameter \(\theta\).
Applications and Limitations:
Background:
The concept of Reference Priors was developed to address some of the limitations encountered with Jeffreys Prior, especially in multivariate contexts where interactions between parameters can complicate the definition and application of non-informative priors.
Key Features:
Formula:
While the specific form of a reference prior can depend on the model and the parameters, its derivation generally involves complex calculations aimed at maximizing the Kullback-Leibler divergence, which may not have a closed-form expression and often requires numerical methods to solve.
Applications and Limitations:
Conjugate priors are a type of prior that, when used with a particular likelihood function, results in a posterior distribution that is of the same family as the prior distribution. This property of conjugacy is particularly valuable because it simplifies the mathematical operations involved in Bayesian updates.
Key Characteristics of Conjugate Priors:
Mathematical Simplicity: Conjugate priors simplify the Bayesian updating process because the posterior distributions are analytically tractable. This means that one can derive explicit formulas for updating the parameters of the distribution, typically referred to as hyperparameters, based on the observed data.
Parameterization: In the context of conjugate priors, the prior distribution is often characterized by a few key parameters, known as hyperparameters. For example, in the case of a normal distribution used as a likelihood, the conjugate prior would also be normal, characterized by parameters like the mean and variance. These hyperparameters are updated to form the posterior based on the data.
For practical Bayesian analysis, conjugate priors are particularly prevalent in cases where the likelihood functions belong to the exponential family, such as:
Common Examples of Conjugate Priors
Here are some common pairs of likelihood functions and their corresponding conjugate priors:
Likelihood Distribution | Conjugate Prior | Posterior Distribution |
---|---|---|
Normal (mean known) | Normal | Normal |
Normal (variance known) | Normal | Normal |
Binomial | Beta | Beta |
Poisson | Gamma | Gamma |
Exponential | Gamma | Gamma |
Multinomial | Dirichlet | Dirichlet |
For each of these cases, the conjugate prior simplifies the calculation of the posterior distribution because the posterior remains in the same distributional family as the prior.
Conjugate priors for the most common statistical families
Understanding Hyperparameters
In a conjugate prior, the parameters of the prior distribution are known as hyperparameters. These hyperparameters are used to define the shape and characteristics of the prior distribution itself. For example:
If we assign a prior distribution to the hyperparameters themselves, they are called hyperpriors, which leads to a hierarchical Bayesian model.
Advantages of Using Conjugate Priors:
Computational Efficiency: The algebraic convenience of conjugate priors allows for straightforward updates of beliefs in light of new data. This efficiency is particularly advantageous in iterative processes or when dealing with large datasets.
Theoretical Elegance: The ability to maintain the same family of distributions before and after observing data offers a closed-form solution for the posterior, which can be elegantly described and understood.
Ease of Interpretation: Because the form of the distribution remains constant, the interpretation of the parameters (such as mean or variance in a normal distribution) remains intuitive and consistent throughout the Bayesian analysis.
Drawbacks of Using Conjugate Priors:
Restrictive Assumptions: The main limitation of conjugate priors is that they might force one to adopt specific distributional forms that may not be substantively justified. The need to maintain conjugacy can impose restrictions on the choice of the prior that might not align with the actual prior knowledge about the parameters.
Limited Flexibility: While the use of conjugate priors simplifies calculations, it also reduces flexibility in modeling. The specific structure required for conjugacy might not adequately capture the complexities or nuances of the prior beliefs or the data.
Hyperparameter Sensitivity: The process requires the selection of hyperparameters, which can significantly influence the posterior. Incorrect or suboptimal choices of these parameters can lead to biased or misleading results, especially if the prior information is insufficient or vague.
Concept and Foundation of Empirical Bayes
Empirical Bayes methods are based on the idea of using observed data to estimate the parameters of the prior distribution in a Bayesian setup. Unlike traditional Bayesian methods, which require the specification of a prior based purely on subjective belief or external information, Empirical Bayes uses an evidence-based approach to determine the prior. This hybrid method falls between the fully Bayesian approach (which relies entirely on subjective priors) and the purely frequentist approach (which does not incorporate prior information at all).
How Empirical Bayes Methods Work The process involves two main steps:
Estimation of Prior Parameters: First, the method uses the aggregate data to estimate the parameters of the prior distribution. This is typically done using maximum likelihood estimation or another suitable frequentist method. The goal here is to capture the common characteristics of the parameter across different observations or experiments.
Bayesian Updating: Once the prior parameters are estimated, each specific instance or data point is analyzed using Bayesian methods, where the empirically estimated prior is updated with the actual observed data to produce a posterior distribution.
Key Features of Empirical Bayes
Data-Driven Priors: The priors are not fixed before seeing the data; instead, they are determined based on the data itself. This is particularly useful in scenarios where little is known about the system beforehand, or when subjective priors are hard to justify.
Reduction in Variance: By borrowing strength from the entire dataset to form the prior, Empirical Bayes methods can reduce the variance of the estimates compared to purely frequentist approaches that treat each problem separately.
Computational Efficiency: Empirical Bayes can be more computationally efficient than fully Bayesian methods since it avoids the need for complex prior specification and the intensive computations that can entail, especially with large datasets.
Expert elicitation priors are a method in Bayesian statistics where subjective judgments from experts are formally incorporated into the Bayesian framework as prior distributions. This approach is particularly useful in scenarios where empirical data is sparse, but expert knowledge is abundant and reliable.
Expert elicitation involves systematically gathering opinions from one or more experts about uncertain quantities and then using these opinions to form prior distributions in Bayesian analysis. The experts provide their insights based on experience, existing research, and intuition, which are then quantified into a statistical format that can be directly used in the probabilistic models.
Process of Developing Expert Elicitation Priors
Selection of Experts: Careful selection of experts is crucial. Experts should have deep and relevant knowledge about the subject matter. Diversity in expertise can help capture a broad range of perspectives and reduce individual bias.
Elicitation Technique: Various techniques can be used to elicit quantitative data from experts, such as interviews, structured questionnaires, or interactive workshops. Techniques like the Delphi method, which involves multiple rounds of questioning with feedback, are commonly used to converge expert opinions towards a consensus.
Quantification of Expert Opinions: The elicited qualitative assessments are converted into quantitative measures. Experts might be asked to estimate parameters directly, provide percentiles for distributions, or express their confidence in different outcomes.
Aggregation of Responses: When multiple experts are involved, their responses need to be aggregated. This can be done through mathematical pooling of individual probability distributions or by using more sophisticated models that weigh expert opinions by their reliability or coherence with empirical data.
Advantages of Using Expert Elicitation Priors
Fills Data Gaps: In cases where empirical data is not available or is incomplete, expert opinions can provide valuable insights that would otherwise be unattainable.
Improves Model Relevance: By incorporating real-world knowledge, the models become more reflective of the actual phenomena being studied, enhancing the relevance and applicability of the statistical analyses.
Facilitates Complex Decision Making: Expert elicitation is particularly beneficial in complex decision-making scenarios, such as policy formulation or risk assessment, where the stakes are high and the problems are too intricate to be captured fully by available data.
Sensitivity analysis in Bayesian statistics is a crucial step to evaluate how the conclusions of a model are affected by the assumptions made, particularly about the prior distributions. Unlike frequentist methods, Bayesian approaches inherently depend on prior assumptions, which makes it vital to test how robust the results are to variations in those priors.
A Framework for Bayesian Sensitivity Analysis
This process, also called the “robust Bayesian” approach, includes these steps:
This is sometimes referred to as prior partitioning.
A Bayesian sensitivity analysis aims to answer:
How robust are our conclusions to different reasonable beliefs about the unknown parameters?
This allows more transparent, context-aware, and decision-relevant interpretation of results. For instance, even if the posterior probability of treatment superiority is high, we may still find that under many reasonable priors, the probability of a clinically significant effect is modest—shifting the interpretation.
1. Purpose of Sensitivity Analysis in Bayesian Approaches
2. Steps in a Robust Bayesian Sensitivity Analysis
3. Types of Prior Communities (Increasing in Complexity)
There are three main approaches to defining a community of priors:
(a) Discrete Set of Priors:
(b) Parametric Family of Priors:
(c) Non-Parametric Family of Priors:
Overview of Hierarchical Models and τ
In a Bayesian hierarchical model, you assume that unit-level parameters (e.g., treatment effects in different studies), denoted as \(\theta_k\), are drawn from a shared population distribution:
\[ \theta_k \sim \mathcal{N}(\mu, \tau^2) \]
This assumes the \(\theta_k\) are exchangeable, meaning you believe there’s no systematic reason to expect one unit’s effect to be predictably different from another, unless explained by covariates.
The key hyperparameter in this setup is the between-unit standard deviation \(\tau\), which governs the degree of heterogeneity:
Bayesian inference requires a prior distribution for \(\tau\). Choosing this prior carefully is essential because:
Three Critical Assumptions in Hierarchical Priors
Markov Chain Monte Carlo (MCMC) is a class of algorithms that allows for the sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. This approach is particularly useful for obtaining a sequence of random samples from a multivariate probability distribution where direct sampling is difficult.
Monte Carlo Simulation:
Markov Chains:
MCMC Process:
Key Methods within MCMC:
Advantages of MCMC:
Challenges with MCMC:
The Expectation-Maximization (EM) algorithm is a robust technique used for parameter estimation in cases where data is incomplete, missing, or has latent variables. It consists of two main steps repeated iteratively: the Expectation step (E-step) and the Maximization step (M-step).
E-step: In this step, the algorithm calculates the expected value of the log-likelihood function, with respect to the conditional distribution of the latent variables given the observed data and the current estimates of the parameters. This step involves filling in missing data, estimating latent variables, or more generally, calculating the expected sufficient statistics that are necessary for the parameter updates in the next step.
M-step: Here, the algorithm finds the parameter values that maximize the expected log-likelihood found in the E-step. These parameters are updated to new values that are used in the next E-step.
The process repeats with these new parameters until the convergence criteria are met, which typically involves the change in the log-likelihood or in the parameter estimates falling below a threshold.
This iterative process helps in making the EM algorithm particularly useful for situations where the model depends on unobserved latent data. The EM algorithm is widely used in various applications like clustering in machine learning (e.g., Gaussian mixture models), bioinformatics (e.g., gene expression analysis), and more.
Key Benefits:
Limitations:
Variational inference (VI) is a method used in Bayesian statistics that approximates probability densities through optimization rather than sampling, as seen in Markov Chain Monte Carlo (MCMC) methods. It is particularly useful when dealing with complex models and large datasets, providing a faster computational alternative.
Process of Variational Inference:
Advantages of Variational Inference:
Limitations:
Historical data has always played a role in experimental planning and meta-analysis. However, Bayesian methods provide a principled and flexible framework to incorporate historical information not just in design, but also in estimation, inference, prediction, and decision-making.
In Bayesian reasoning, historical data can help shape the prior distribution, inform assumptions, or contribute directly to combined evidence. But not all historical data are equally relevant or reliable — so it’s crucial to judge how and to what extent past data should influence current analysis.
The framework described here outlines six levels of relevance, ranging from completely discarding historical data to treating it as fully equivalent to current data.
Different assumptions relating parameters underlying historical data to the parameter of current interest: single arrows represent a distribution, double arrows represent logical functions, and wavy arrows represent discounting.
Historical data are judged to have no bearing on the current study.
This might be because of:
In Bayesian terms, this means the prior is formed without using the historical data.
This assumes that past studies provide no useful information about the current study’s parameter (θ). Therefore, the prior is constructed without reference to historical data—possibly using a vague, flat, or reference prior.
The past and current studies are assumed to be similar in structure and quality, though not identical.
This allows for exchangeability of parameters, meaning:
This is a standard approach in meta-analysis, and Bayesian hierarchical models (or multilevel models) are well-suited to this setup.
Here, we assume the current parameter (θ) and historical parameters (θ₁, …, θ_H) are exchangeable, meaning they are thought to come from the same population distribution:
\[ θ₁, θ₂, ..., θ_H, θ \sim \mathcal{N}(\mu, \tau^2) \]
If the variances (σ²_h) and τ² are known, the posterior for θ is:
\[ θ \mid y₁, ..., y_H \sim \mathcal{N} \left( \frac{\sum y_h w_h}{\sum w_h}, \frac{1}{\sum w_h} + \tau^2 \right) \]
where:
\[ w_h = \frac{1}{σ_h^2 + \tau^2} \]
Historical data may contain internal (methodological flaws) or external (population/context mismatch) biases.
Rather than discarding them, Bayesian models can:
This is especially useful in observational studies or when reusing non-randomised data.
The adjustments can be formalised using bias distributions or discount functions.
Historical parameters are modeled as biased versions of the current parameter. For each h:
\[ θ_h = θ + δ_h \]
δ_h is a bias term that can be:
This results in a prior:
\[ θ \mid y_h \sim \mathcal{N}(y_h, σ_h^2 + τ_h^2) \]
The past data are considered unbiased, but less reliable than current data.
So they are included with reduced precision — in effect, you “shrink” their influence.
This approach acknowledges that historical data may be:
In practice, this can be implemented by:
Here we assume:
\[ θ_h = θ \]
so the parameters are identical across studies, but we discount the strength of the past evidence by a factor α (0 < α ≤ 1), known as the power prior.
The effect is to reduce the effective sample size (or weight) of historical data:
\[ θ \mid y_h \sim \mathcal{N}(y_h, \frac{σ_h^2}{α}) \]
The current parameter of interest is a deterministic or functional function of parameters estimated in past studies.
For example:
This kind of structure can be expressed using Bayesian networks or modular models, linking different sources via functional relationships.
The current parameter is a known function of parameters from historical studies.
Example:
θ₁ = treatment effect in males
θ₂ = treatment effect in females
If a future study involves 60% males, then:
\[ θ = 0.6 θ₁ + 0.4 θ₂ \]
Prior for θ is derived from known or estimated priors for θ₁ and θ₂.
This method is particularly useful when populations differ but in predictable ways.
Historical and current data are considered completely equivalent.
Parameters are assumed to be drawn from the same distribution, and data can be pooled directly.
This is the most optimistic assumption and works best when:
In effect, you assume exchangeability of individuals (not just studies), so the prior and current data are merged seamlessly.
This assumes all studies are measuring exactly the same parameter:
\[ θ_h = θ \quad \text{for all } h \]
\[ θ \mid y₁, ..., y_H \sim \mathcal{N} \left( \frac{\sum y_h / σ_h^2}{\sum 1 / σ_h^2}, \frac{1}{\sum 1 / σ_h^2} \right) \]
In real-world health care evaluation, statistical inference often faces multiplicity—the need to analyze multiple parameters:
To handle this, Bayesian hierarchical models offer a structured approach that accounts for:
Three conceptual assumptions help guide the modeling:
Identical parameters (complete pooling) All units are assumed to estimate the same underlying effect \(\theta\), and their data are fully pooled. For example:
\[ Y_k \sim N(\theta, \sigma_k^2) \]
Bayesian updating leads to a pooled posterior for \(\theta\), resembling a weighted average of individual estimates.
Independent parameters (no pooling) Each \(\theta_k\) is estimated independently.
\[ \theta_k \sim \text{Uniform}, \quad Y_k \sim N(\theta_k, \sigma_k^2) \]
This gives posterior:
\[ \theta_k \mid y_k \sim N(y_k, \sigma_k^2) \]
No information is shared between units.
Exchangeable parameters (partial pooling via hierarchical models) Each \(\theta_k\) is viewed as drawn from a population distribution:
\[ \theta_k \sim N(\mu, \tau^2), \quad Y_k \sim N(\theta_k, \sigma_k^2) \]
Here, \(\mu\) and \(\tau^2\) are hyperparameters—estimated from the data (empirical Bayes) or given priors (full Bayes). The posterior shrinks each unit’s estimate toward \(\mu\), with the shrinkage factor:
\[ B_k = \frac{\sigma_k^2}{\sigma_k^2 + \tau^2} \]
Larger \(\tau^2\): less pooling. Smaller \(\tau^2\): more shrinkage.
First, assuming all parameters are identical implies complete pooling. Under this model, each unit (e.g., trial or subgroup) is estimating the same underlying effect. All data are combined into one analysis, and individual differences are ignored. Statistically, this is equivalent to estimating a single parameter using weighted averages. The result is a high degree of certainty, but potentially oversimplified if real variation exists.
Second, assuming all parameters are independent implies no pooling. Each unit is analyzed separately, with no borrowing of information. This is the most conservative approach and gives wide uncertainty intervals. Statistically, this treats each observed effect as arising from its own prior, typically uniform, resulting in posterior distributions that are just normal likelihoods centered on the data.
Third, the most nuanced assumption is that of exchangeability. Here, the individual parameters (such as treatment effects) are assumed to be drawn from a common population distribution. This does not imply they are the same, but rather that we have no reason to believe any one is systematically different from the others. Exchangeability leads naturally to hierarchical or multi-level models, where the unit-specific effects are modeled as random effects with a common mean and variance.
Under the exchangeability assumption, each observed effect is shrunk toward the group mean, and the degree of shrinkage depends on the relative variances of the data and the prior. Smaller, more uncertain studies shrink more; large, precise studies shrink less. This is known as partial pooling and helps stabilize estimates, particularly when individual studies are small or noisy.
A practical example is the magnesium meta-analysis. Early small trials showed a large mortality benefit for magnesium in acute myocardial infarction. A fixed-effects meta-analysis found a significant odds ratio of 0.67. A hierarchical Bayesian model, assuming exchangeability of trial effects, moderated this to an odds ratio of 0.58, as extreme results from small trials were shrunk toward the overall mean. This demonstrates how Bayesian models can temper overoptimistic findings from underpowered studies.
To assess the credibility of such findings, sceptical priors centered on no effect (odds ratio = 1) are introduced. In the pooled analysis, a highly sceptical prior equivalent to 421 events is needed to make the posterior just include 1—an unrealistically extreme prior. In the random-effects model, a sceptical prior equivalent to only 58 events is sufficient to neutralize the evidence. This suggests the evidence is more fragile and less convincing under the hierarchical model.
Nuisance parameters are common in most real-world models. For example, if the main goal is to estimate a treatment effect (denoted by θ), the data may also depend on background factors, variances, or baseline event rates, which are not of direct interest. These are nuisance parameters.
Classical Methods for Eliminating Nuisance Parameters
Several techniques in traditional statistics aim to remove or reduce the influence of nuisance parameters from the likelihood function for θ:
Likelihood approximation Use a summary statistic or transformation whose likelihood does not depend on nuisance parameters. For example, normal approximations for odds ratios or hazard ratios eliminate dependence on unknown variances.
Plug-in estimates Estimate the nuisance parameters (usually by maximum likelihood) and substitute them into the likelihood for θ. This is computationally simple but can underestimate uncertainty, especially if the number of nuisance parameters is large. In Bayesian terms, this is known as the empirical Bayes approach.
Conditional likelihood Form a conditional likelihood that depends only on θ, by conditioning on aspects of the data that are uninformative about the nuisance parameters. This is a popular method in likelihood-based inference.
Profile likelihood Maximize the likelihood over nuisance parameters for each value of θ, producing a “profile” likelihood function. This function depends only on θ and can be used for inference. This method is illustrated in the hierarchical modeling.
Each of these methods ultimately reduces the problem to one involving only θ, which can then be combined with a prior to conduct a Bayesian analysis.
The Fully Bayesian Approach
The full Bayesian treatment of nuisance parameters avoids removing them from the likelihood altogether. Instead:
This method fully accounts for uncertainty in the nuisance parameters and is consistent with Bayesian principles. However, it can be computationally intensive, especially in high-dimensional models.
This approach is used in various situations, such as:
When hierarchical models are used, often the full Bayesian model is used only at the higher levels (e.g., for between-group variance τ), while using approximations at the lower (sampling) level.
Sensitivity analysis is emphasized as essential when placing priors on nuisance parameters. Even seemingly non-informative priors can exert significant influence on the posterior if not carefully chosen. In such cases, combining traditional techniques for handling nuisance parameters with Bayesian analysis of the primary parameter may offer a good compromise.
Here is a detailed explanation of the Bayesian Checklist for reporting and evaluating Bayesian analyses, especially in the context of health-care intervention studies. The checklist serves as a framework to ensure transparency, reproducibility, and credibility of Bayesian modeling and decision-making.
1. Background
The Intervention Clearly describe the intervention being evaluated, the context (clinical or policy), and the target population. This should make clear the relevance and scope of the analysis.
Aim of Study Specify what you are trying to infer (the parameters of interest, such as treatment effect) and what decisions or actions (if any) may follow from the analysis. The former requires a prior distribution; the latter should be supported by a loss or utility function.
2. Methods
Study Design Describe the design of the study or studies being analyzed. If data are pooled from multiple sources, comment on their similarity to justify any assumptions of exchangeability (i.e. whether they can be treated as if drawn from the same distribution).
Outcome Measure State the quantity of primary interest (e.g. odds ratio, risk difference, mean effect) that you are estimating or predicting.
Statistical Model Lay out the relationship between the observed data and the underlying parameters. This could be a full probabilistic model (likelihood plus prior), or a description detailed enough for a competent analyst to reconstruct it mathematically.
Prospective Bayesian Analysis Indicate whether the Bayesian components (e.g., priors, utility functions) were defined before data collection (prospective), or adapted after seeing the data (retrospective). Mention whether interim analyses were planned.
Prior Distribution Clearly specify priors for all parameters of interest.
Loss Function or Decision Framework If the study involves decisions (e.g., whether to proceed with treatment approval), specify the loss or utility function. This might involve:
Computation and Software Describe what computational methods were used—especially if Markov Chain Monte Carlo (MCMC) or similar techniques were involved.
3. Results
Evidence from the Study Report the raw data as clearly as confidentiality permits:
4. Interpretation
Bayesian Interpretation Summarize the posterior distribution:
Make a clear distinction between:
Sensitivity Analysis Report how different prior distributions or loss functions affected the results. If the conclusions change substantially, this needs to be acknowledged and discussed.
Comments Provide a candid evaluation of:
This checklist functions as a template for best practices in Bayesian analysis, ensuring clarity and reproducibility. It complements more general reporting standards like CONSORT but is focused on Bayesian-specific elements like priors, posterior distributions, and decision-making under uncertainty.
It would be misleading to dichotomise statistical methods as either ‘classical’ or ‘Bayesian’, since both terms cover a bewildering range of techniques. A rough taxonomy can be developed by distinguishing two characteristics: whether or not prior distributions are used for inferences, and whether the objective is estimation, hypothesis testing or a decision requiring a loss function of some form. A
Whether prior knowledge is used:
What the analysis aims to achieve:
Here is the summarized structure:
Objective | Informal (No Prior) | Formal (Uses Prior) |
---|---|---|
Inference | Fisherian | Proper Bayesian |
Hypothesis Test | Neyman–Pearson | Bayes Factors |
Decision | Classical Decision Theory | Full Decision-Theoretic Bayesian |
Fisherian (Inference + No Prior)
Focuses on likelihood-based estimation, using maximum likelihood or likelihood intervals without formal priors.
The Fisherian method focuses on the likelihood function, which measures how well different parameter values are supported by the observed data. Key elements include:
Criticism: Fisher’s use of P-values has been misapplied over time—used as formal proof rather than as measures of evidence.
Proper Bayesian (Inference + Prior)
Estimates parameters by combining prior distributions with likelihoods to form posterior distributions.
Neyman–Pearson (Testing + No Prior)
Classical hypothesis testing framework: use of null/alternative hypotheses, p-values, Type I and II errors.
Neyman–Pearson theory is based on decision-making under repeated sampling and is more procedural:
Criticism: The approach doesn’t allow direct probability statements about the hypothesis or observed result in a specific trial. Its concepts apply to repeated experiments, not necessarily to the one at hand.
Bayes Factors (Testing + Prior)
Bayesian hypothesis testing using the ratio of marginal likelihoods under two models (see Bayes Factor).
Classical Decision Theory (Decision + No Prior)
Uses frequentist ideas to minimize expected loss, assuming no prior distribution on parameters.
Full Decision-Theoretic Bayesian (Decision + Prior)
Makes decisions by minimizing expected loss with respect to a posterior distribution. Explicitly combines beliefs and consequences.