A Short Exposition on Bayesian Inference and Probability

 

[Image] John Stutz & Peter Cheeseman -- 01 June 1994 [Image]

 

Bayes' theorem gives the rule for updating belief in a Hypothesis H (i.e.

the probability of H) given additional evidence E, and background

information (context) I:

 

p(H|E,I) = p(H|I)*p(E|H,I)/p(E|I) [Bayes Rule]

 

The left-hand term, p(H|E,I), is called the posterior probability, and it

gives the probability of the hypothesis H after considering the effect of

evidence E in context I. The p(H|I) term is just the prior probability of H

given I alone; that is, the belief in H before the evidence E is considered.

The term p(E|H,I) is called the likelihood, and it gives the probability of

the evidence assuming the hypothesis H and background information I is true.

The last term, 1/p(E|I), is independent of H, and can be regarded as a

normalizing or scaling constant. The information I is a conjunction of (at

least) all of the other statements relevant to determining p(H|I) and

p(E|I).

 

Note that all of these probabilities are conditional - they specify the

degree of our belief in some proposition(s) under the assumption that some

other propositions are true. We require that the conditioning propositions

include, at least implicitly, all of the information used to determine the

probability of the conditioned proposition(s). Failure to do so renders the

probability calculation vacuous, since any two such calculations may then

obtain different results. Thus probability is a relation between conditioned

hypothesis and conditioning information - it is meaningless to talk about

THE probability of a hypothesis without also giving the evidence that that

probability value is based on.

 

Bayes theorem is a simple consequence of the Product Rule from probability.

The product rule gives the probability of the logical conjunction (and) of

two statements A and B, written as A,B -- i.e.;

 

p(A,B|I) = p(A|B,I)*p(B|I) = p(B|A,I)*p(A|I). [Product Rule]

 

Bayes rule is derived by rearranging the terms in the above equality. Using

the product rule directly, we can extend Bayes rule to multiple sequential

updates:

 

p(H|E1,E2,E3,I) = p(H|I)*p(E1,E2,E3|H,I)/p(E1,E2,E3|I)

 

p(H|I)*p(E1|H,I)*p(E2|E1,H,I)*p(E3|E2,E1,H,I)

= ---------------------------------------------

p(E1|I)*p(E2|E1,I)*p(E3|E2,E1,I)

 

Note the difficulty here. As each new piece of evidence is factored into the

calculation, its effect is conditional on all the previously considered

evidence. This difficulty is usually overcome by making conditional

independence assumptions, such as:

 

p(E2|E1,I) = p(E2|I) and p(E1|E2,I) = p(E1|I).

 

In other words: given I, knowing that E2 is true tells us nothing about E1,

and vice versa. Thus E2 contains no information regarding E1 that is not

already present in I. Under conditional independence the product rule

reduces to:

 

p(E1,E2|I) = p(E1|I)*p(E2|I).

 

And when multiple evidence Ei are conditionally independent under I, and

thus H,I, the multiple update version of Bayes' rule reduces to:

 

p(H|I)*p(E1|HI)*p(E2|HI)*p(E3|HI)...

p(H|I E1 E2 E3 ...) = ------------------------------------- .

p(E1|I) *p(E2|I) *p(E3|I) ...

 

This last equation greatly simplifies the problem of bringing evidence to

bear on a hypothesis. But it must be used with caution, because conditional

independence does not always hold. The seductive simplicity of this

conditionally independent version of Bayes rule has lead many analysts to

grief.

 

[Image]

 

The second fundamental rule of probability theory is the Sum rule. The Sum

rule gives the probability of the logical disjunction (or) of two statements

A and B, written as A+B:

 

p(A+B|I) = p(A|I) + p(B|I) - p(A,B|I) [General Sum Rule].

 

The Sum and Product rules are the primary mathematical consequents of our

desire that Probability Theory be consistent with Aristotelian Logic. Thus

as belief goes to the extreme limits of truth or falsity, the Probability

calculus reverts to the Predicate calculus. Jaynes gives an excellent

development showing how these rules can be derived from basic desiderata of

rational belief.

 

A set of n hypothesis is said to be Mutually Exclusive w.r.t. I if

 

p(Hi,Hj|I) = 0 for i \= j.

 

This holds when I is such that no two of the Hj can be simultaneously true.

It is equivalent to saying:

 

p(Hi|Hj,I) = p(Hj|Hi,I) = 0 for i \= j.

 

Under exclusiveness the sum rule reduces to:

 

p(H1+H2|I) = p(H1|I)+p(H2|I).

 

A set of hypothesis is said to be Exhaustive w.r.t. I if:

 

p(H1+H2+...+Hn|I) = 1.

 

This holds when I is such that at least one of the Hj must be true.

Alternatively one can state this as (H1+H2+...+Hn|I) = T.

 

When a set of Hj is both Mutually Exclusive and Exhaustive given information

I, we can use the sum rule above to get rid of a proposition by summing over

all the possibilities. This proposition elimination by summing is called

marginalization:

 

p(E,H1|I) + p(E,H2|I) +...+ p(E,Hn|I)

= p(H1|E,I)p(E|I) + p(H2|E,I)p(E|I) +...+ p(Hn|E,I)p(E|I)

= (p(H1|E,I) + p(H2|E,I) +...+ p(Hn|E,I)) p(E|I) grouped sum

= p(H1 + H2 +...+ Hn|E,I) p(E|I) exclusiveness

= p(T|E,I) p(E|I) exhaustiveness

= p(E|I)

 

which gives the denominator for Bayes' rule.

 

Marginalization is also a powerful technique for accounting for the effects

of nuisance parameters. These are parameters which clearly affect the

probability of the evidence under a hypothesis, but which are of no interest

in the current calculation. By marginalizing over the mutually exclusive and

exhaustive set of possible parameter values, we can account for their effect

while eliminating them from the final result.

 

[Image]

 

The sum and product rules, conditional independence, and marginalization

provide the basic tools for Bayesian belief updating. But in order to apply

these tools we must first have the likelihood probabilities p(E|H,I) of the

evidence under each hypothesis and the prior probabilities p(H|I) of the

hypothesis independent of the evidence. The likelihoods are fairly

straightforward, since they come from knowledge about the domain -- i.e. if

you knew that H described the true state of the world, what would you expect

to see? Their specification has been a major preoccupation of statisticians

for two centuries and there is a vast literature describing such functions

and their applicability. They are a principal subject of any statistics

course, and will not be further discussed here.

 

Prior probabilities are another matter, having been not merely ignored by

statisticians, but abhorred. This is undoubtedly due to the fact that, until

recently, there has been little agreement on how to consistently specify

such probabilities, and to the subjective nature of the prior probabilities.

Recent work on the testability of prior information, invariance constraints

on certain types of prior information, and maximum entropy arguments have

done much to correct this situation. Minimum information priors for most of

the standard parameter types are now agreed upon, but informative priors are

still an area of active research. In practice, for any real problem, there

is sufficient domain knowledge to specify weak normalizable prior

probabilities. If the posterior probability is still strongly dependent on

the assumed prior, then more evidence should be obtained, or the prior

probabilities be improved based on careful consideration of the domain.

 

Bayes' theorem is only part of Bayesian inference as developed by Laplace,

Jefferys, Jaynes, and successors. Other probability concepts, such as, the

product and sum rules, the concept of conditional independence and the

technique of marginalization are also necessary. None of these concepts can

be used until one can extract prior probabilities from a problem statement;

hence use of prior probabilities is one of the distinguishing

characteristics of Bayesian inference.

 

[Image]

 

Bayesian Learning Group

[Image]