Probability: definitions and interpretations
I will be a little formal [1] for a moment here as we construct this
mathematical notion of probability. First, we need to define the world
of possibilities. We denote by
We define the probability of event
The probability must be nonnegative;
for all .The probability that an event was drawn from the entire sample space is one;
.The probability of the empty set is zero;
. Along with the previous axiom and the requirement that range from zero to one, this essentially says that only events in the sample space are allowable outcomes.If
are disjoint events, then
This means that probability is additive. The probability of observing an event in the union of disjoint events is the sum of the probabilities of those events.
Putting together these axioms, we see that probability consists of positive real numbers that are distributed among the events of a sample space. The sum total of these real numbers over all of the sample space is one. So, a probability function and a sample space go hand-in-hand. For many of our applications, the sample space consists of set of numbers like the real numbers, integers, and subsets of real numbers and integers.
Interpretations of probability
Before we go on to talk more about probability, it will help to be thinking about how we can apply it to understand measured data. To do that, we need to think about how probability is interpreted. Note that these are interpretations of probability, not definitions. We have already defined probability, and both of the two dominant interpretations below are valid.
Frequentist probability.
In the frequentist interpretation of probability, the probability
Bayesian probability.
Here,
You may have heard about a split, or even a fight, between people who use Bayesian and frequentist interpretations of probability applied to statistical inference. There is no need for a fight. The two ways of approaching statistical inference differ in their interpretation of probability, the tool we use to quantify uncertainty. Both are valid.
In my opinion, the Bayesian interpretation of probability is more intuitive to apply to scientific inference. It always starts with a simple probabilistic expression and proceeds to quantify plausibility. It is conceptually cleaner to me, since we can talk about plausibility of anything, including parameter values. In other words, Bayesian probability serves to quantify our own knowledge, or degree of certainty, about a hypothesis or parameter value. Conversely, in frequentist statistical inference, the parameter values are fixed (they are not random variables; they cannot vary meaningfully from experiment to experiment), and we can only study how repeated experiments will convert the real parameter value to an observation.
That is my opinion, and I view fights over such things counterproductive. Frequentist methods are also very useful and powerful, and in this class, we will almost exclusively use them. Next term, we will use almost exclusively Bayesian methods.
The sum rule, the product rule, and conditional probability
The sum rule, which may be derived from the axioms defining
probability, says that the probability of all events must add to
unity. Let
Now, let’s say that we are interested in event
for any
The product rule states that
where
for any
Bayes’s Theorem
Note that because “and” is commutative,
If we take the terms at the beginning and end of this equality and rearrange, we get
This result is called Bayes’s theorem. This result holds for probability, regardless of how it is interpreted, frequentist, Bayesian, or otherwise.
Marginalization
Let
Now, Bayes’s theorem gives us an expression for
Therefore, we have
Using the definition of conditional probability, we also have
This process of eliminating a variable (in this case the