Probability: The foundation for generative modeling

Any data set we encounter was generated by some process, usually a process that involved the ingenuity, blood, sweat, and tears of an experimenter. It we want to learn something more general about nature from acquired data, we need to have a model for the data generation process. Rob Phillips said it beautifully in his book Physical Biology of the Cell, “Quantitative data demand quantitative models.” We call models that describe the process of generating data generative models.

We will see in the following lessons that building generative models requires the mathematical machinery of probability and we model data generation with generative probability distributions.

When we perform an experiment and obtain data, we are sampling out of the generative distribution. The true generative distribution is unknown, but by sampling out of it, we gain insights about the generative process. For example, if I measure the heights of a collection of humans, I learn something about the generative distribution just by investigating the samples out of it (the measured data).

Similarly, we can learn a lot about probability distributions, including model generative distributions, by sampling out of them directly using random number generation. In this section, we will also learn about the techniques for doing so.

We will proceed with a lack of formality, but will nonetheless give useful working definitions of probability and aspects thereof with an eye for putting them to use for modeling and interpreting data.