STA260 Lecture 11

Review
- Maximum Likelihood Estimation
- If $X \sim f_{X} (x; θ)$ where $θ$ is a parameter.
- We have the score function as:
  - $S (θ) = \frac{\partial}{\partial θ} \log f_{X} (x; θ)$
  - This gives us information about the strength and direction of evidence using $X$ and $θ$
- For sufficienciently small $ϵ > 0$
  - $S (θ) > 0 ⟹ \underset{Small inc. in θ inc. likelihood}{\underset{⏟}{L (θ + ϵ) > S (θ)}}$
  - Small $ϵ$ shows us the localized behaviour of $L (θ)$ around $θ$ .
  - $S (θ) < 0 ⟹ \underset{Small dec. in θ inc. likelihood}{\underset{⏟}{L (θ - ϵ) > S (θ)}}$
  - This is useful if we have something in higher dimensions we can't visualize. We can use the score function to get a sense of where the maximum is.
- Fisher Information
  - Measures information about (a point) or a sample.
  - $I (θ) = E [{(\frac{\partial}{\partial θ} \log f_{X} (x; θ))}^{2}] = - E [\frac{\partial^{2}}{\partial θ} \log f_{X} (x; θ)]$
  - High fisher information means $L (θ)$ is steep around its maximum, which means precise parameter estimation with data is possible.
  - Low fisher information means $L (θ)$ is flat around its maximum, which means not really a precise parameter estimation.
Under certain conditions (regularity conditions) on $f_{X} (x; θ)$ , $\sqrt{n I (θ)} (\hat{θ} - θ)$ converges in distribution to a standard normal distribution as $n \to \infty$ .
- This means that $\hat{θ}$ is approximately normally distributed with mean $θ$ and variance $\frac{1}{n I (θ)}$ for large $n$ .
- $\sqrt{n I (θ)} (\hat{θ} - θ) \overset{D}{\to} N (0, 1)$
- Under regularity conditions:
  - $L (θ)$ is twice differentiable with respect to $θ$ .
  - $I (θ) > 0$
  - $\hat{θ} \overset{P}{\to} θ$
    - Consistent Estimator
- $I (θ)$ : Fisher Information of a point.
- $n I (θ)$ : Fisher Information in a sample of size $n$ .
  - Required to have multiple data points to get a precise estimate of $θ$ .
- $\frac{\bar{X} - μ}{\frac{σ}{\sqrt{n}}} \overset{D}{\to} N (0, 1)$
- $\frac{\sqrt{n}}{σ (\bar{X} - μ)} \overset{D}{\to} N (0, 1)$
- $\bar{X} - μ \overset{D}{\to} N (0, \frac{1}{{(\frac{\sqrt{n}}{σ})}^{2}})$
- $\bar{X} - μ \overset{D}{\to} N (0, \frac{σ^{2}}{n})$
- Now let's try to manipulate to show the above for fisher information.
- $\sqrt{n I (θ)} (\hat{θ} - θ) \overset{D}{\to} N (0, 1)$
- $\hat{θ} - θ \overset{D}{\to} N (0, \frac{1}{(\sqrt{n I (θ)})^{2}})$
- $\hat{θ} - θ \overset{D}{\to} N (0, \frac{1}{n I (θ)})$
- $\hat{θ} \overset{D}{\to} N (θ, \frac{1}{n I (θ)})$
- So variance of $\hat{θ}$ is $\frac{1}{n I (θ)}$ .
Example Using Theorem 10
- Find the distribution the MLE of $\hat{λ}$
  - $\sqrt{n I (θ)} (\hat{θ} - θ) \overset{D}{\to} N (0, 1)$
- Let $X_{1}, \dots, X_{n} \sim Poisson (λ)$
- $f_{X} (x; λ) = \frac{λ^{x} e^{- λ}}{x!} x = 0, 1, 2, \dots$
- $\log f_{X} (x; λ) = x \log λ - λ - \log x!$
- $\frac{\partial}{\partial λ} \log f_{X} (x; λ) = - 1 + x λ^{- 1} - 0$
- $\frac{\partial^{2}}{\partial λ} \log f_{X} (x; λ) = - x λ^{- 2}$
- Fisher Information:
  - $I (λ) = - E [\frac{\partial^{2}}{\partial λ} \log f_{X} (x; λ)]$
  - $I (λ) = - E [- X λ^{- 2}]$
  - $I (λ) = λ^{- 2} E [X]$
  - $I (λ) = λ^{- 2} λ$
  - $I (λ) = \frac{1}{λ}$
  - #tk practice doing with first derivative method as well.
- In the previous section we know that ${\hat{λ}}_{M L E} = \bar{X}$
- By theorem 10:
  - $\sqrt{n I (λ)} (\hat{λ} - λ) \overset{D}{\to} N (0, 1)$
  - $\sqrt{\frac{n}{λ}} (\hat{λ} - λ) \overset{D}{\to} N (0, 1)$
  - $\hat{λ} - λ \overset{D}{\to} N (0, \frac{1}{\frac{n}{λ}})$
  - $\hat{λ} - λ \overset{D}{\to} N (0, \frac{λ}{n})$
  - $\hat{λ} \overset{D}{\to} N (λ, \frac{λ}{n})$
  - We know ${\hat{λ}}_{M L E} = \bar{X}$
  - $\bar{X} \overset{D}{\to} N (λ, \frac{λ}{n})$
Bayesian Approach to Parameter Estimation
- Review from STA256, Bayes' Theorem
  - For two events $A$ and $B$ with $P (B) > 0$ , we have:
  - $P (A | B) = \frac{P (B | A) P (A)}{P (B)}$
  - Apply this but for parameter estimation:
  - $P (θ | X) = \frac{P (X | θ) P (θ)}{P (X)}$
- Let $θ$ be a parameter of interest and $X$ be the observed data.
  - We look at $P (θ | X)$
  - Probability of a parameter given data.
    - $P (θ | X) = \frac{P (X | θ) P (θ)}{P (X)}$
    - $P (θ)$ is our prior belief about $θ$ before seeing the data.
      - Initial best guess.
      - Can be based on previous studies, expert knowledge, or subjective belief.
      - Can be informative (specific belief) or non-informative (vague belief).
    - $P (X | θ)$
      - Likelihood of the data given the parameter $θ$ .
      - How well does the parameter explain the observed data?
      - Same as the likelihood function in MLE.
    - $P (X)$
      - Marginal likelihood or evidence.
      - Normalizing constant to ensure $P (θ | X)$ is a valid probability distribution.
      - Calculated by integrating over all possible values of $θ$ : $P (X) = \int P (X | θ) P (θ) d θ$
    - $P (θ | X)$
      - Posterior distribution of $θ$ given the observed data $X$ .
        
        Prior: Past.
        
        Posterior: Future - Present Beliefs.
      - Updated belief about $θ$ after observing the data.
      - Combines prior belief and likelihood of the observed data.
  - Select a prior $P (θ)$
    - Best initial guess / starting point.
  - Determine the likelihood $P (X | θ)$
    - Calculate using assumed distribution of $θ$
    - How well does the parameter explain the observed data?
  - Notice:
    - $P (θ | X) P (X) = P (X | θ) P (θ)$
    - $P (θ | X) = \frac{1}{P (X)} P (X | θ) P (θ)$
    - $P (X)$ doesn't involve $θ$ , no parameter.
      - Then treat it as a constant. Toss it.
      - $f_{X} (x | θ) = \frac{f_{X, θ} (x; θ)}{f_{θ} (θ)}$
      - $f_{X, θ} (x; θ) = f_{X} (X | θ) f_{θ} (θ)$
      - To get the marginal distribution of $x$ , we integrate out $θ$ :
      - $f_{X} (x) = \int f_{X, θ} (x, θ) \differential θ$
      - $f_{X} (x) = \int f_{X} (X | θ) f_{θ} (θ) \differential θ$
      - Substitute it into Bayes' Theorem.
      - $f (θ | X) = \frac{f_{X, θ} (X | θ) f_{θ} (θ)}{f_{X} (x)}$
      - $f (θ | X) = \frac{f_{X, θ} (X | θ) f_{θ} (θ)}{\int f_{X} (X | θ) f_{θ} (θ) \differential θ}$
      - The denominator is hard to integrate, only gives info on our data.
      - No info on the parameter $θ$ . So we can ignore it.
        
        $f (θ | X) \propto f_{X} (X | θ) f_{θ} (θ)$
        
        $\propto$ means "proportional to"
        
        $f (θ | X) = \frac{1}{C} f_{X} (X | θ) f_{θ} (θ)$ where $C$ is a constant that doesn't depend on $θ$ .
    - $P (θ | X) \propto P (X | θ) P (θ)$
  - Examine the resulting form, compare it with a known distribution. Then you get the form of the posterior distribution.
  - Example:
    - $X_{1}, \dots, X_{n} \sim Poisson (λ)$
    - $P (X | λ) = \prod_{i = 1}^{n} \frac{λ^{x_{i}} e^{- λ}}{x_{i}!}$
    - $Λ \sim Γ (α, β)$ is our prior.
    - Apply the Bayesian approach to find the posterior distribution of $λ$ .
    - We have the prior, so we just need the likelihood.
    - Assuming independence, we can write the likelihood as a product of the individual likelihoods for each $X_{i}$ .
    - $L (λ) = P (X_{1}, \dots, X_{n} | λ) = f_{X} (x | λ)$
    - $= \prod_{i = 1}^{n} f_{X_{i}} (X_{i} | λ)$
    - $= \prod_{i = 1}^{n} \frac{e^{- λ} λ^{x_{i}}}{(x_{i})!}$
    - $= \frac{e^{- n λ} λ^{\sum_{i = 1}^{n} x_{i}}}{x_{1}! + x_{2}! + \dots + x_{n}!}$
    - $L (X | λ) = \frac{1}{\prod_{i = 1}^{n} x_{i}!} e^{- n λ} λ^{\sum_{i = 1}^{n} x_{i}}$
    - The first part is a constant, so we can ignore it. Just has data.
      - $λ$ has a prior as $Γ (α, β)$
      - $f_{X} (x) = \frac{β^{α} x^{α - 1} e^{- β x}}{Γ (α)}$
      - Different from the formula sheet.
      - To find $f_{Λ} (λ)$ do change of variables:
      - $f_{Λ} (λ) = \frac{β^{α} λ^{α - 1} e^{- β λ}}{Γ (α)}$
      - $f_{Λ} (λ) = \frac{β^{α}}{Γ (α)} λ^{α - 1} e^{- β λ}$
      - Throw away the constants, just has data.
    - Bring it together:
      - $f (λ | X) \propto f (X | λ) f (λ)$
      - Likelihood times prior.
      - Group the garbage terms together, just has data.
      - $\propto \underset{No λ ⟹ gone}{\underset{⏟}{\frac{β^{α}}{Γ (α)} \frac{1}{\prod_{i = 1}^{n} x_{i}!}}} e^{- n λ} λ^{\sum_{i = 1}^{n} x_{i}} λ^{α - 1} e^{- β λ}$
      - $\propto e^{- λ (n + β)} λ^{α - 1 + \sum_{i = 1}^{n} x_{i}}$
      - Let $α^{'} = α + \sum_{i = 1}^{n} x_{i}$ and $β^{'} = n + β$
      - $\propto λ^{α^{'} - 1} e^{- λ β^{'}}$
      - Because we're missing the $\frac{β^{α}}{Γ (α)}$ term, we can say.
      - $\propto \frac{(β^{'})^{α^{'}}}{Γ (α^{'})} λ^{α^{'} - 1} e^{- λ β^{'}}$ now divide by the constant to make it a valid distribution.
      - $\propto \frac{Γ (α^{'})}{(β^{'})^{α^{'}}} \frac{(β^{'})^{α^{'}}}{Γ (α^{'})} λ^{α^{'} - 1} e^{- λ β^{'}}$
      - Absorb the first part into the propto.
      - $\propto \frac{(β^{'})^{α^{'}}}{Γ (α^{'})} λ^{α^{'} - 1} e^{- λ β^{'}}$
      - Looks similar to a gamma distribution, so we can say $λ | X \sim Γ (α^{'}, β^{'})$
      - $λ | X \sim Γ (α + \sum_{i = 1}^{n} x_{i}, n + β)$