1. 1 Introduction
Testing whether a hypothesis is true or false by calculating the probability of an event in a prolonged experiment is known as frequentist statistics
- An experiment with an infinite number of trials guarantees $p$ with absolute accuracy
- It’s not practial to conduct an experiment with an infinite number of trials
- deciding the value of this sufficient number of trials is a challenge
- If we can determine the confidence of the estimated $p$ , it will allow us to decide whether to
- accept the conclusion
- extend the experiment with more trials until it achieves sufficient confidence
- prior beliefs (for example, coins are usually fair and the coin used is not made biased intentionally, therefore $p≈0.5$)
- play a significant role in shaping the outcome of a hypothesis test
- However, it ==can’t== be used along with frequentist statistics
2. 2 Bayesian Learning
Consider the flip coin experiment, if you flip the coin $10$ times, there are 2 case:
5 heads and 5 tails
more confidence about that $p=0.5$
$x$ head and $10-x$ tails
Now you have 2 options
- frequentist statistics :Neglect prior beliefs, just based on data
- Bayesian Learning :Adjust your belief according to observation
2.1. 2.1 Bayes’ Theorem
Bayes’ theorem
$$
P ( \theta | X ) = \frac { P ( X | \theta ) P ( \theta ) } { P ( X ) }
$$
- $P ( \theta )$ Prior Probability is the probability of the hypothesis $\theta$ being true
- $P ( X | \theta )$ Likelihood is the conditional probability of the evidence given a hypothesis
- $P ( X )$ Evidence :a summation (or integral) of the probabilities of all possible hypotheses
- $P ( \theta | X )$ Posteriori probability
2.2. 2.2 Application1: MAP
Used to confirm the valid hypothesis using posterior probabilities
$$
\begin{aligned} \theta { M A P } & = \operatorname { argmax } { \theta } P \left( \theta { i } | X \right) \ & = \operatorname { argmax } { \theta } \left( \frac { P ( X | \theta { i } ) P \left( \theta { i } \right) } { P ( X ) } \right) \end{aligned}
$$
$P(x)$ is independent of $\theta$ . Therefore, we can simplify the $\theta_{MAP} $ to
$$
\theta_ { M A P } = \operatorname { argmax } { \theta } \left( P ( X | \theta { i } ) P \left( \theta _ { i } \right) \right)
$$
2.3. 2.3 How to Achieve the Goal
For the $P ( X | \theta _ { i } )$ part
Use $MLE$
$$
\frac { d } { d \theta } \ln P (X | \theta ) = 0
$$
For the $P \left( \theta _ { i } \right)$ part
if the hypothesis space is continuous, it will be endless hypothesis.
So in practical, we use approximation techniques : Beta prior distribution
$$
P ( \theta ) = \frac { \theta ^ { \alpha - 1 } ( 1 - \theta ) ^ { \beta - 1 } } { B ( \alpha , \beta ) }
$$
- $B ( \alpha , \beta )$ acts as the normalizing constant
2.4. 2.4 Example
As a Binomial probability example
$$
P ( k , N | \theta ) = \left( \begin{array} { c } { N } \ { k } \end{array} \right) \theta ^ { k } ( 1 - \theta ) ^ { N - k }
$$
$$
P ( \theta ) = \frac { \theta ^ { \alpha - 1 } ( 1 - \theta ) ^ { \beta - 1 } } { B ( \alpha , \beta ) }
$$
So the posterior
$$
\begin{aligned} P ( \theta | N , k ) & = \frac { P ( N , k | \theta ) \times P ( \theta ) } { P ( N , k ) } \ & = \frac { \left( \begin{array} { c } { N } \ { k } \end{array} \right) } { B ( \alpha , \beta ) \times P ( N , k ) } \times \theta ^ { ( k + \alpha ) - 1 } ( 1 - \theta ) ^ { ( N + \beta - k ) - 1 } \end{aligned}
$$
and consider it as a new Beta prior distribution
$$
P ( \theta | N , k ) = \frac { \theta ^ { \alpha { n e w } - 1 } ( 1 - \theta ) ^ { \beta { n e w } - 1 } } { B \left( \alpha { n e w } , \beta { n e w } \right) }
$$
3. 3 Bayes Optimal Classifier
Given new instance $x$, the hypothesis we get previously $h _ { M A P } ( x )$ may be not the most probable classification.
e.g.1
$$
P \left( h { 1 } | D \right) = 0.4 , P \left( h { 2 } | D >\right) = 0.3 , P \left( h _ { 3 } | D \right) = 0.3
$$
New instance$$
h { 1 } ( x ) = + , h { 2 } ( x ) = - , h _ { 3 } ( x ) = -
$$
Bayes optimal classifier tell us that we should find:
$$
\arg \max { v { j } \in V } \sum { h { i } \in H } P \left( v { j } | h { i } \right) P \left( h _ { i } | D \right)
$$
e.g.1
$$
P \left( h { 1 } | D \right) = 0.4 , P ( - | h { 1 } ) = 0 , >P ( + | h { 1 } ) = 1 \ P \left( h { 2 } | D \right) = 0.3 , >P ( - | h { 2 } ) = 1 , P ( + | h { 2 } ) = 0 \ P \left( h { >3 } | D \right) = 0.3 , P ( - | h { 3 } ) = 1 , P ( + | h { 3 >} ) = 0
$$
therefore
$$
\begin{aligned} \sum { h { i } \in H } P ( + | h { i } ) P >\left( h { i } | D \right) & = .4 \ \sum { h { i } \in H } P >( - | h { i } ) P \left( h _ { i } | D \right) & = .6 >\end{aligned}
$$
so we choose $-$ as the classification
4. 4 Navie Bayes Classifier
When to use:
- Large training set available
- Attributes are independent
- Used in diagnosis and text classification
Navie Bayes assumption:
$$
P \left( a { 1 } , a { 2 } \ldots a { n } | v { j } \right) = \prod { i } P \left( a { i } | v { j } \right)
$$
Bayes MAP
$$
\begin{aligned} v { M A P } & = \underset { v { j } \in V } { \operatorname { argmax } } \frac { P \left( a { 1 } , a { 2 } \ldots a { n } | v { j } \right) P \left( v { j } \right) } { P \left( a { 1 } , a { 2 } \ldots a { n } \right) } \ & = \underset { v { j } \in V } { \operatorname { argmax } } P \left( a { 1 } , a { 2 } \ldots a { n } | v { j } \right) P \left( v { j } \right) \end{aligned}
$$
Navie Bayes Classifier
$$
v { N B } = \underset { v { j } \in V } { \operatorname { argmax } } P \left( v { j } \right) \prod { i } P \left( a { i } | v _ { j } \right)
$$
5. 5 Bayesian Belief Networks
Improve Navie Bayes Classifier:
- Attributes are independent $\to$ too restrictive
- So Bayes Nets describe conditional independence among subsets
Conditional Independence
$$
P ( X | Y , Z ) = P ( X | Z )
$$
e.g. $P ( \text { Thunder } | \text { Rain, Lightning) } = P ( \text { Thunder } | \text {Lightning} )$
Bayes nets use cond. indep. to justify
$$
\begin{aligned} P ( X , Y | Z ) & = P ( X | Y , Z ) P ( Y | Z ) \ & = P ( X | Z ) P ( Y | Z ) \end{aligned}
$$
A network
in general
$$
P \left( y { 1 } , \ldots , y { n } \right) = \prod { i = 1 } ^ { n } P \left( y { i } | \text { Parents } \left( Y { i } \right) \right)
$$
So, joint distribution is fully defined by graph, and $P \left( y { i } | \text { Parents } \left( Y _ { i } \right) \right)$
Learning task:
- Graph structure might be known or unknown
- Training data might provide all or some variables
Case 1: known
- Simlilar to training neural network with hidden units
- Using gradient ascent
- to maximizes $P ( D | h )$
- Using EM (Expection Maximization)
- to maximizes $E [ \ln P ( D | h ) ]$
Case 2: unknown
- Use greedy search to add/substact edges and nodes
- …… ==in research==
6. Appendix: EM
When to use? Data is only partially observable
6.1. Question Definition
Given:
- Observed data $X = \left{ x { 1 } , \ldots , x { m } \right}$
- Unobserved data $Z = \left{ z { 1 } , \ldots , z { m } \right}$
- Probablity distribution $P ( Y | h )$
- $Y = \left{ y { 1 } , \dots , y { m } \right}$ where $y { i } = x { i } \cup z _ { i }$
- $h$ parameters
Determin:
$h$ that maximizes $E [ \ln P ( Y | h ) ]$
6.2. General Method
Use $h$ and $X$ estimate $Z$
$$
E [ \ln P ( Y | h ) ] \to E [ \ln P ( Y | h ^ { \prime } ) | h , X ]
$$