Skip to content

Method | A Crash Course For Bayesian Inference

Posted on:November 1, 2024

This is the final presentation for {SG8001: Teaching Students: First Steps}.

Table of contents

1. After This Crash Course, You Will Be Able To:

2. How to Infer? Frequentist vs. Bayesian Perspective

Statistical inference is a process of using a random sample to draw conclusions about the unknown characteristics of an underlying population.

When you’ve used methods like Maximum Likelihood Estimation (MLE) to generate point estimates, build confidence intervals, or conduct hypothesis tests using p-values, you’ve been working within the frequentist framework.

However, this approach has some limitations. For instance:

Bayesian inference, on the other hand, takes a different approach to addressing these challenges by offering a more flexible philosophical framework:

3. Three Components in a Bayesian Pipeline

Bayesian inference is essentially about using observed data to update prior knowledge about the population, resulting in what is called the posterior. You can make further inferences about the population using this updated knowledge, which incorporates both the current data and your original beliefs. The three core components in this updating process are:

Using Bayes’ rule, we can link these three components together to understand how to update from prior to posterior:

p(θD)=p(Dθ)p(θ)p(D)=p(Dθ)p(θ)θp(Dθ)p(θ)p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)} = \frac{p(D|\theta)p(\theta)}{\sum_{\theta} p(D|\theta)p(\theta)}

The denominator is the normalization factor, often called evidence, or the average probability of the data over the prior. Once we plug in the observed data and integrate (or sum) over θ, it becomes a constant number. Hence, when performing tasks like MAP, we can effectively ignore the denominator and only maximize the numerator (although we often do not recommend using MAP, as it wastes a lot of the information brought by the posterior):

p(θD)p(Dθ)p(θ)p(\theta|D) \propto p(D|\theta)p(\theta)

Next, we will dive into a toy example to illustrate this Bayesian pipeline.

4. Toy Example: Biased Coin

Consider a biased coin with an unknown probability θ of landing heads (H). You toss the coin five times and observe the following sequence: HTHHH (where T represents tails). Your task is to estimate θ based on this data. Let’s solve this problem using Bayesian inference.

First, we can write the likelihood function:

p(Dθ)=θ4(1θ)p(D|\theta) = \theta^4 (1-\theta)

Next, we need to choose a prior distribution before proceeding.

1) Uniform(0,1) as Prior

This prior indicates that we have no prior knowledge about the coin’s bias, meaning every value of θ between 0 and 1 is equally likely. This is known as an uninformative prior in this context. The probability density function is:

p(θ)=1p(\theta) = 1

the posterior distribution is proportional to the likelihood times the prior:

p(θD)p(Dθ)p(θ)=θ4(1θ)p(\theta|D) \propto p(D|\theta)p(\theta) = \theta^4 (1-\theta)

To find the Maximum A Posteriori (MAP) estimate, we maximize the posterior:

arg maxθ p(θD)=arg maxθ θ4(1θ)=0.8\argmax_{\theta}\ p(\theta|D) = \argmax_{\theta}\ \theta^4 (1-\theta) = 0.8

This result is identical to the Maximum Likelihood Estimate (MLE) in the frequentist approach.

2) Beta(2,2) as Prior

Now, let’s assume a Beta(2,2) prior, reflecting our prior belief that the probability of heads is centered around 0.5, but with some uncertainty. This prior is informative, encoding a modest amount of initial belief before observing any data. The prior distribution is:

p(θ)=Beta(θ;2,2)p(\theta) = \text{Beta}(\theta;2,2)

Thus, the posterior is:

p(θD)θ4(1θ)×Beta(θ;2,2)==Beta(θ;6,3)p(\theta|D) \propto \theta^4 (1-\theta) \times \text{Beta}(\theta;2,2) = \cdots = \text{Beta}(\theta;6,3)

Maximizing the posterior gives us the MAP estimate:

arg maxθ p(θD)=570.714\argmax_{\theta}\ p(\theta|D) = \frac{5}{7} \approx 0.714

Comparing this with the previous estimate, we can see that θ=0.714 is pulled slightly closer to 0.5, reflecting the influence of the prior belief that θ is likely around 0.5.

3) Beta(20,20) as Prior

In this case, we use a Beta(20,20) prior, which indicates even greater confidence that the coin is balanced before observing the sequence HTHHH. After incorporating the observed data, the posterior becomes Beta(24,21). Using MAP, we estimate θ as (24-1)/(24+21-2)=23/43=0.535.

This estimate is significantly lower than the estimate from the Beta(2,2) prior, bringing it closer to 0.5, which reflects the stronger influence of our more confident prior belief.

5. Take-Home Notes

6. Get your hands dirty

Suppose you‘re trying to estimate the true value of a parameter μ\mu, which represents the mean of a normal population with a known variance σ2=1\sigma^2=1. You have a sample of observed data consisting of independent measurements: x=[2.1,1.9,2.2,2.0,1.8]x = [2.1, 1.9, 2.2, 2.0, 1.8]. Try to use different normal priors and perform MAP estimations. Then, compare the results with the MLE result.