A Short Exposition on Bayesian Inference and Probability
[Image] John Stutz & Peter Cheeseman -- 01 June 1994 [Image]
Bayes' theorem gives the rule for updating belief in a Hypothesis H (i.e.
the probability of H) given additional evidence E, and background
information (context) I:
p(H|E,I) = p(H|I)*p(E|H,I)/p(E|I) [Bayes Rule]
The left-hand term, p(H|E,I), is called the posterior probability, and it
gives the probability of the hypothesis H after considering the effect of
evidence E in context I. The p(H|I) term is just the prior probability of H
given I alone; that is, the belief in H before the evidence E is considered.
The term p(E|H,I) is called the likelihood, and it gives the probability of
the evidence assuming the hypothesis H and background information I is true.
The last term, 1/p(E|I), is independent of H, and can be regarded as a
normalizing or scaling constant. The information I is a conjunction of (at
least) all of the other statements relevant to determining p(H|I) and
p(E|I).
Note that all of these probabilities are conditional - they specify the
degree of our belief in some proposition(s) under the assumption that some
other propositions are true. We require that the conditioning propositions
include, at least implicitly, all of the information used to determine the
probability of the conditioned proposition(s). Failure to do so renders the
probability calculation vacuous, since any two such calculations may then
obtain different results. Thus probability is a relation between conditioned
hypothesis and conditioning information - it is meaningless to talk about
THE probability of a hypothesis without also giving the evidence that that
probability value is based on.
Bayes theorem is a simple consequence of the Product Rule from probability.
The product rule gives the probability of the logical conjunction (and) of
two statements A and B, written as A,B -- i.e.;
p(A,B|I) = p(A|B,I)*p(B|I) = p(B|A,I)*p(A|I). [Product Rule]
Bayes rule is derived by rearranging the terms in the above equality. Using
the product rule directly, we can extend Bayes rule to multiple sequential
updates:
p(H|E1,E2,E3,I) = p(H|I)*p(E1,E2,E3|H,I)/p(E1,E2,E3|I)
p(H|I)*p(E1|H,I)*p(E2|E1,H,I)*p(E3|E2,E1,H,I)
= ---------------------------------------------
p(E1|I)*p(E2|E1,I)*p(E3|E2,E1,I)
Note the difficulty here. As each new piece of evidence is factored into the
calculation, its effect is conditional on all the previously considered
evidence. This difficulty is usually overcome by making conditional
independence assumptions, such as:
p(E2|E1,I) = p(E2|I) and p(E1|E2,I) = p(E1|I).
In other words: given I, knowing that E2 is true tells us nothing about E1,
and vice versa. Thus E2 contains no information regarding E1 that is not
already present in I. Under conditional independence the product rule
reduces to:
p(E1,E2|I) = p(E1|I)*p(E2|I).
And when multiple evidence Ei are conditionally independent under I, and
thus H,I, the multiple update version of Bayes' rule reduces to:
p(H|I)*p(E1|HI)*p(E2|HI)*p(E3|HI)...
p(H|I E1 E2 E3 ...) = ------------------------------------- .
p(E1|I) *p(E2|I) *p(E3|I) ...
This last equation greatly simplifies the problem of bringing evidence to
bear on a hypothesis. But it must be used with caution, because conditional
independence does not always hold. The seductive simplicity of this
conditionally independent version of Bayes rule has lead many analysts to
grief.
[Image]
The second fundamental rule of probability theory is the Sum rule. The Sum
rule gives the probability of the logical disjunction (or) of two statements
A and B, written as A+B:
p(A+B|I) = p(A|I) + p(B|I) - p(A,B|I) [General Sum Rule].
The Sum and Product rules are the primary mathematical consequents of our
desire that Probability Theory be consistent with Aristotelian Logic. Thus
as belief goes to the extreme limits of truth or falsity, the Probability
calculus reverts to the Predicate calculus. Jaynes gives an excellent
development showing how these rules can be derived from basic desiderata of
rational belief.
A set of n hypothesis is said to be Mutually Exclusive w.r.t. I if
p(Hi,Hj|I) = 0 for i \= j.
This holds when I is such that no two of the Hj can be simultaneously true.
It is equivalent to saying:
p(Hi|Hj,I) = p(Hj|Hi,I) = 0 for i \= j.
Under exclusiveness the sum rule reduces to:
p(H1+H2|I) = p(H1|I)+p(H2|I).
A set of hypothesis is said to be Exhaustive w.r.t. I if:
p(H1+H2+...+Hn|I) = 1.
This holds when I is such that at least one of the Hj must be true.
Alternatively one can state this as (H1+H2+...+Hn|I) = T.
When a set of Hj is both Mutually Exclusive and Exhaustive given information
I, we can use the sum rule above to get rid of a proposition by summing over
all the possibilities. This proposition elimination by summing is called
marginalization:
p(E,H1|I) + p(E,H2|I) +...+ p(E,Hn|I)
= p(H1|E,I)p(E|I) + p(H2|E,I)p(E|I) +...+ p(Hn|E,I)p(E|I)
= (p(H1|E,I) + p(H2|E,I) +...+ p(Hn|E,I)) p(E|I) grouped sum
= p(H1 + H2 +...+ Hn|E,I) p(E|I) exclusiveness
= p(T|E,I) p(E|I) exhaustiveness
= p(E|I)
which gives the denominator for Bayes' rule.
Marginalization is also a powerful technique for accounting for the effects
of nuisance parameters. These are parameters which clearly affect the
probability of the evidence under a hypothesis, but which are of no interest
in the current calculation. By marginalizing over the mutually exclusive and
exhaustive set of possible parameter values, we can account for their effect
while eliminating them from the final result.
[Image]
The sum and product rules, conditional independence, and marginalization
provide the basic tools for Bayesian belief updating. But in order to apply
these tools we must first have the likelihood probabilities p(E|H,I) of the
evidence under each hypothesis and the prior probabilities p(H|I) of the
hypothesis independent of the evidence. The likelihoods are fairly
straightforward, since they come from knowledge about the domain -- i.e. if
you knew that H described the true state of the world, what would you expect
to see? Their specification has been a major preoccupation of statisticians
for two centuries and there is a vast literature describing such functions
and their applicability. They are a principal subject of any statistics
course, and will not be further discussed here.
Prior probabilities are another matter, having been not merely ignored by
statisticians, but abhorred. This is undoubtedly due to the fact that, until
recently, there has been little agreement on how to consistently specify
such probabilities, and to the subjective nature of the prior probabilities.
Recent work on the testability of prior information, invariance constraints
on certain types of prior information, and maximum entropy arguments have
done much to correct this situation. Minimum information priors for most of
the standard parameter types are now agreed upon, but informative priors are
still an area of active research. In practice, for any real problem, there
is sufficient domain knowledge to specify weak normalizable prior
probabilities. If the posterior probability is still strongly dependent on
the assumed prior, then more evidence should be obtained, or the prior
probabilities be improved based on careful consideration of the domain.
Bayes' theorem is only part of Bayesian inference as developed by Laplace,
Jefferys, Jaynes, and successors. Other probability concepts, such as, the
product and sum rules, the concept of conditional independence and the
technique of marginalization are also necessary. None of these concepts can
be used until one can extract prior probabilities from a problem statement;
hence use of prior probabilities is one of the distinguishing
characteristics of Bayesian inference.
[Image]
Bayesian Learning Group
[Image]