Excel Chukwu.

Exploring Probability and Counting

A deep understanding of the basics is necessary for advanced statistical methods.

On weekdays, I read portions of technical books I have chosen for the month/period of time. On the weekends, I synthesize and consolidate what I have read into articles and small projects (when possible) to ensure understanding. The synthesis articles also serve as a form of spaced repetition.

This month, I read Statistics Every Programmer Needs by Gary Sutton and Think Distributed Systems by Dominik Tornow.

This is the full synthesis of Chapter 2 of Statistics Every Programmer Needs.


Understanding Probability and Counting

I think that, perhaps one of man’s greatest pursuits is the quest for determinism: to fit the unknown into the known, the uncertain into the certain. Our desire to predict the weather, to predict markets, to quantify and explain various observed phenomena, and the desire to grasp randomness are expressions of the pursuit of determinism. The study of probability and the broader field of statistics is perhaps one of mankind’s greatest progress towards that pursuit.

Probabilities are a cornerstone of perhaps every aspect of our modern lives. It’s fingerprints are found everywhere and in everything from daily decision making to massively quantitative techniques. Understanding probabilities allow us to quantify uncertainty, assess risks, and make informed predictions from data.

Many examples exist to build an intuitive understanding of probability and counting, including famous ones such as rolling a pair of dice, flipping coins, or selecting cards from a deck.

Notification

For the following illustration, I will use the (perhaps) more relatable example of unlocking a phone, after which I will continue with the famous examples listed above because they have lesser total possible outcomes that are easier to reason about. I will occasionally reference it again, though.

Most mobile devices allow you to set a 4 to 6 digit pin to control access to your device, and I will assume 6 digits for this example.

A 6 digit pin keypad. Source: https://www.google.com/search?q=6+digit+pin+keypad&tbm=isch&ved=2ahUKEwiI6t6Ct6v9AhVWnFwKHXi1C1oQ2-cCegQIABAA&oq=6+digit+pin+keypad&gs_lcp=CgNpbWcQAzIECCMQJzIECCMQJzIE

For a pin to work, some basic rules are established:

  1. Each value of the pin can take any number between 0 - 9, inclusive: [0, 9].
  2. Order matters. The pin 012345102345.
  3. Additionally, while not often advised, pins allow for repetition. Thus, a six-digit pin could look like 0 0 0 0 0 0.
  4. The choice or value of a digit does not inherently affect the value of another. That is, each choice is independent.

These rules form the basis of ‘Counting’ (specifically permutations), a closely related area and foundational tool used within probability theory which will be addressed in more detail later.

We could then calculate all the possible values for a 6-digit pin, knowing that each digit can be 0-9, order matters, repetition is allowed, and each choice is independent by observing that for all six positions, all ten digits from 0 - 9 are possible.

Thus, there are $10^6 = 1,000,000$ possible pins. This is an example of the multiplication rule of counting.

We could also calculate the probability of someone guessing the correct pin in one try, assuming independence and no pin repeats, as $(1 / 1,000,000)$, and extend this to multiple tries (e.g. $(3 / 1,000,000)$ for three attempts.

Additionally, one could ask, “How does the probability change if the device restricts repeated digits or enforces other rules?” The answer to this and the idea of sample spaces will be discussed in more detail in Chapter 3’s synthesis article.

Types of Probabilities

  • theoretical: the number of outcomes and what constitutes success are constrained and easily well-defined. Flipping coins, rolling dice, and selecting cards are based on a theoretical or assumed understanding of equally likely outcomes in a fixed and controlled environment.

  • empirical: derived from trials, experiments, or real-world observations. For example, if a teacher observes that, over the span of 10 tests given in a semester, a student passes 7 tests and somehow absolutely bombs the remaining 3, they could conclude that the probability of passing any given test is 7/10. The probability of success in such situations can be defined as:

    $$P(Event) = \frac{number \space of \space successes \space observed}{number \space of \space observations \space made}$$

  • subjective: based on personal judgement, opinions, and beliefs rather than on observed data or mathematical calculations.

Calculating Probabilities

From the simple calculations above, we can express probability as:

$$ Probability(Event) = \frac{total \space number\space of\space successful\space outcomes}{total\space number\space of\space possible\space outcomes} $$

or more simply:

$$ Probability(Event) = \frac{number\space of\space successes}{number\space of\space outcomes} $$

It is important to get the denominator, which represents the sample space or total outcomes right and precisely define what qualifies as a success.

Odds and Probability

Colloquially, a statement like, “What are the odds…” in the context of an event occurring is quite common. Odds, however, are not quite the same as probabilities. While we have defined probability as number of successes / number of outcomes, odds are expressed as the ratio between successes and failures:

$$odds = \frac{number \space of \space potential \space successes}{number \space of \space potential \space failures}$$

or

$$odds = \frac{probability \space of \space success}{probability \space of \space failures}$$

For example, the probability of selecting a face card from a standard deck of 52 cards is 12/52 (there are 12 face cards in a standard deck) or 0.231.

The probability of NOT selecting a face card from a standard deck of 52 cards is (52 - 12) / 52 $\equiv$ 40/52 or 0.769.

Thus, odds of selecting a face card equal:

$$ \frac{number \space of \space face \space cards}{number \space of \space non \space face \space cards} $$

or:

$$ \frac{probability \space of \space face \space card}{probability \space of \space non \space face \space card} $$

Which is equivalent to:

$$ \frac{12}{40} \equiv \frac{0.231}{0.769} \equiv 0.3 $$

We can also derive probabilities from odds: i.e., we can derive the probability of selecting a face card from the odds of selecting a face card. We simply insert the odds into the following formula: $P(Event) = \frac{odds}{odds + 1}$. So that:

$$ P(face \space card) = \frac{0.3}{0.3 + 1} = \frac{0.3}{1.3} = 0.231 $$

or a 23% probability of selecting a face card from a standard deck of 52 cards.

Counting Rules

[Arithmetic] Counting may be second nature to many, but the mathematics underlying the concept of counting in combinatorics (the branch of mathematics that deals with counting, arranging, and selecting objects in finite or discrete systems) is much more interesting than it may at first seem.

To better understand counting, I think it’s important to define some terms.

  • independent events: events whose outcome do not depend on or influence the outcome of another event.
  • dependent events: events whose outcome depend on or influence the outcome of another event
  • mutually exclusive events: events that cannot occur at the the same time.

There are two fundamental counting rules: the multiplication and addition rules.

Multiplication Rule

Formally, the multiplication rule states that

if there are $i$ choices to be made, with $n_1$ possibilities for the first choice, $n_2$ possibilities for the second choice, and so forth, then the aggregate number of potential outcomes equals the product of the individual choices, denoted as $n_1 \times n_2 \times … n_i$

In a nutshell, this means that when you have $n$ positions to select for, with a finite number of possibilities for each position, the total possible outcome is equal to the product of the number of possibilities for each position.

This rule is used when the outcomes of two or more events are sequential or simultaneous but independent of each other.

An example of where the multiplication rule applies is the number of possible outcomes of a 3-digit lock. Each digit is between 0 and 8, so the total possible outcome amounts to $10 \times 10 \times 10$, or $1000$ unique outcomes.

Addition Rule

Formally, the addition rule states that

The total probability of one or another of the mutually exclusive events occurring is the sum of their individual probabilities.

It’s important to note that this rule applies specifically to events that cannot happen simultaneously.

An example where this rule applies is the probability of obtaining either a $3$ or $5$ from rolling a single fair six-sided die. You cannot simultaneously get both values, so the probability of getting either is the sum of their individual probabilities—in this case $\frac{1}{6} + \frac{1}{6} = \frac{2}{6} = \frac{1}{3}$. We could alternatively think in terms of the total number of outcomes. If given the option to either flip a coin or roll a die, the outcomes of these events are mutually exclusive because you can only do one of these actions at a time. There are $2$ possible outcomes for flipping a coin and $6$ possible outcomes for rolling the die. The total number of potential outcomes is $2 + 6 = 8$, assuming you choose to perform only one of these actions.

Understanding the correct application of these rules is crucial for accurately computing probabilities.

Combinations and Permutations

TL;DR:

  • counting when order doesn’t matter: combination
  • counting when order matters: permutation

Permutations with Replacement

Given a set of items, a permutation with replacement involves selecting items in a specific order, where after each selection the chosen item is returned to the set before the next selection is made, thereby making the chosen item eligible to be selected again.

The first illustration given in this article of unlocking a device using a PIN code is an example of a permutation with replacement because:

  • order matters
  • reusing digits is allowed

Because the digits are immediately replaced once they are used, the number of available choices remains unchanged throughout; it’s not decremented by one immediately following each selection.

Decrementing by one occurs when replacement is not allowed. Thus, each choice reduces the number of available options.


When setting the first digit of a 4-digit PIN for a device, we have ten numerals, 0 through 9, to choose from. Because each digit can be reused, we have the same ten choices when setting the second, third, and fourth digit.

Thus, the formula to get the number of potential outcomes when the order matters and replacement is allowed is simply $n^r$. Where,

  • $n$ is the number of distinct items (events or choices)
  • $r$ is the number of sections made

Permutations without Replacement

As stated earlier, when replacement is not allowed, each choice effectively decreases the number of available options. That is, while order remains relevant, we have to nonetheless reduce the number of available choices for each successive selection$^{[1]}$.

The formula to calculate the number of potential outcomes when the order matters and replacement is not allowed is given by: $\frac{n!}{(n-r)!}$.

Sutton explains the intuition quite well:

This formula is derived from the concept that, for each [$n$] position in a sequence, you have one less option as each item is used up [represented by the factorial in the numerator]. Initially, there are $n$ choices for the first position, $n-1$ choices for the second, and so on, until you have made $r$ selections. The factorial in the denominator, $(n-r)!$, accounts for the reduction in available choices as each item is selected, reflecting the fact that replacement is not allowed.

When $n$ is $5$ and $r$ is $3$, the number of potential outcomes is equal to $\frac{5!}{(5-3)!}$ or $\frac{120}{2}$, which is $60$. So if any $3$ runners out of $5$ can qualify for a medal, there are 60 first, second, and third place permutations possible.

You may notice that we can obtain the same result by multiplying the three digits, $5 \times 4 \times 3$, representing the number of available choices for each position. Essentially, we can quickly calculate permutations without replacement using the idea of factorials rather than attempting to apply the full formula specified above.

Combinations without Replacement

Sutton beautifully explains how, [perhaps] the easiest way to think about combinations without replacement is to first think about permutations without replacement, then insert an adjustment to eliminate the significance of order.

When order no longer matters, the number of distinct outcomes is significantly reduced.

E.g., the sequences $123$, $132$, $213$, $231$, $312$, and $321$ are all considered different when the order matters, but when the order doesn’t matter, these six sequences are treated as identical, leaving only one relevant outcome.

To find the number of ways three digits can be sequenced when the order matters, we apply the factorial function: $3!$, which yields $6$. Which is to say that permutations without replacement have six times as many possible outcomes as combinations without replacement when $n$ is $5$ and $r$ is $3$.

Consequently, we can extrapolate this idea to any value of $r$ (the number of choices we have to make) and adjust the permutations without replacement formula so as to reduce it by an order of magnitude equal to the number that was just calculated, because we no longer care about order.

Thus, the number of combinations without replacement can be derived by plugging the values for $n$ and $r$ into the the following equation:

$$ \frac{n!}{(n-r)!} \times \frac{1}{r!} $$

or simply:

$$ \frac{n!}{r!(n-r)!} $$

When $n=5$ and $r=3$, that comes to $120 \div 12 = 10$, showing that the number of permutations without replacement is, indeed exactly six times more than the number of combinations without replacement.

Combinations with Replacement

A combination with replacement involves selecting items from a set where any item can potentially be selected multiple times, and the order of selection doesn’t matter.

One of the beauties of first principles understanding is the ability to layer knowledge and consolidate understanding. It is much easier to build an understanding of combinations with replacement once combinations without replacement is understood. The formula for calculating the number of potential combinations with replacement is actually derived from the formula for counting combinations without replacement: $\frac{n!}{r!(n-r)!}$.

To understand how the the formula for combinations with replacement is derived, I will use the popular ‘stars and bars’ illustration, as it actually took me some time to understand why r - 1 dummy items had to be introduced due to how Sutton’s ‘dummy items’ illustration was phrased.

Whenever I want to get ice cream, I usually want multiple flavors. Assuming that today, the 5 flavors available are chocolate (C), vanilla (V), strawberry (S), pistachio (P), and mint (M), but I want just $3$ scoops, and I can choose to repeat a flavor. How many different combinations are possible?

Some examples:

  • C, C, C (all chocolate) — definitely not me
  • C, V, S (one chocolate, one vanilla, one strawberry)
  • V, V, P (two vanilla, one pistachio)

Notice that C, V, S is the same as V, S, C because order doesn’t matter.

Now imagine the $5$ flavors as dividers (bars) that separate each ice cream flavor I choose. So picking $3$ flavors out of $5$ would look like:

  • ⭐️⭐️⭐️| | | | (all chocolate)
  • ⭐️ | ⭐️ | ⭐️ | | (1 chocolate, 1 vanilla, 1 strawberry, 0 pistachio, 0 mint)
  • | ⭐️⭐️| |⭐️| (0 chocolate, 2 vanilla, 0 strawberry, 1 pistachio, 0 mint)

We’re arranging $3$ stars and $4$ bars (we need $n - 1$ bars to create $n$ sections). In total, that’s arranging (counting) $7$ objects, where we choose which $3$ positions get stars (or equivalently, which $4$ get bars). This means that, in addition to our total number of flavors, $n$, of $5$, we are adding $r-1$ ( $3 - 1 = 2$ ) ‘dummy items’. Alternatively, we’re adding $n-1$ ( $5-1=4)$ flavors to $3$ choices ($r$).

The above also means that we are choosing:

  • Which $k$ positions get stars, OR equivalently
  • Which $(n-1)$ positions get bars

This generalizes for an $n$ and all $r$.

So we can effectively modify the combinations without replacement formula to obtain the combinations with replacement formula like so:

$$ \frac{(n + (r - 1))!}{r!((n + (r - 1)) - r)!} $$

or simply:

$$ \frac{(n + r - 1)!}{r!(n - 1)!} $$

This formula gives us $C(n+k-1, k)$ or equivalently $C(n+k-1, n-1)$.

Tip

As a side note, I actually truly understood this combination with replacement formula as I thought about how to explain it for this article—which was why I started the synthesis articles.

Random Variables

Similar to how programming variables store values, a random variable assigns a value to each possible outcome of a random experiment or process. There are two types of random variables: continuous and discrete.

Continuous Random Variables

A continuous random variable is one that can take any value within a certain range, [a, b], where a and b represent the inclusive lower and upper bounds respectively. Because they can assume an infinite number of values within a continuous range, every possible value has a non-zero probability of occurrence. However, this does not mean that each value has equal probability of occurrence.

A lot of phenomena in our observable world are continuous, but they are often easier to reason about (and measure) in discrete terms. Examples include time, length, and speed.

We measure the values of continuous random variables using the probability density function (PDF) and the cumulative distribution function (CDF).

Probability Density Function (PDF)

The PDF, denoted $f(x)$ or $p(x)$ where $x$ is the random variable, quantifies the likelihood of a continuous random variable falling within a specific range of values.

It also denotes the relative likelihood—i.e. the likelihood compared to other possible values—of the random variable taking on a specific value or falling within a specific interval. Essentially, it says, “of all the possible values that the random variable can take at any point in time, what is the probability that it will take on $x$?”

The shape of the PDF depends on the underlying distribution of the data being modeled: different distributions have their own unique PDFs. However, regardless of the underlying distribution, the PDF curve is often a smooth line—the smoothness of which is determined by the sample size.

Notification

Probability distributions will be addressed in detail in the next synthesis article for this book.

Cumulative Distribution Function (CDF)

The CDF, denoted as $F(x)$ or $P(X \le x)$, gives the probability that the continuous random variable $X$ takes on a value less than or equal to given point $x$.

For a CDF $F(x)$ associated with the random variable $X$ and values $a$ and $b$:

  • the function $F(x)$ is non-decreasing.This means that as $x$ increases, the probability $P(X \le x)$ does not decrease. So, if $a < b$, then $F(a) \le F(b)$
  • the value of $F(x)$, as a probability, is always equal to some value between 0 and 1. That is, $0 \le F(x) \le 1$
  • we use the formula $P{a < X \le b} = F(b) - F(a)$ to find the probability that the random variable $X$ takes on a value within the interval $[a, b]$. This means that, given $F(a)$ as the probability that $X \le a$, and $F(b)$ as the probability that $X \le b$, the probability that $X$ falls strictly in the range $(a, b]$ is the difference $F(b) - F(a)$.
  • To compute the value of $F(a)$, assuming we know the PDF, we find the area under the PDF between $0$ and $a$.

It is important to note that the shape of the CDF is connected to that of the PDF. So, as the PDF changes according to the distribution being modeled, the CDF adapts accordingly. For a normal distribution, the CDF takes on an $‘S’$ shape.

Discrete Random variables

Discrete random variables represent outcomes with distinct, countable values. For example, when rolling a six-sided die, the only possible outcomes are $1, 2, 3, 4, 5, 6$. Other fractional outcomes such as $3.3$ are impossible.

Some differences between continuous and discrete random variables are:

  • discrete random variables have a probability mass function (PMF), as opposed to the PDF associated with continuous variables.
    • while the PDF describes relative likelihoods within a continuous range, the PMF provides the probability of each individual outcome for the variable—i.e., the actual probability of getting each specific outcome.
    • the PMF is often associated with the uniform distribution, where each possible outcome is assigned equal probabilities within a finite range.
  • the CDF for discrete random variables is a step function.

Probability Mass Function (PMF)

As mentioned earlier, the PMF, denoted as $P(X = x)$, gives the specific probability of each individual outcome for the variable.

The specific PMF formula depends on the probability distribution being modeled—i.e., different probability distributions have their own unique PMFs.

The PMF must mathematically satisfy two properties:

  • non-negativity: all probabilities are always greater than or equal to zero, as the PMF assigns non-negative probabilities to each outcome.
  • summation: the sum of all probabilities for all possible outcomes of the random variable is equal to $1$. That is, $\sum P(X = x) = 1$.

You will probably notice that some of the illustrations above are best modeled using the PMF.

Cumulative Distribution Function (CDF)

The CDF of a discrete random variable is expressed similarly to that of a continuous random variable: $F(x) = P(X \le x)$. That is, it gives the probability that $X$ is less than or equal to a certain value $x$.

It assumes the following properties:

  • as described earlier, it is non-decreasing
  • bounded: CDF is bounded between $0$ and $1$, inclusive.
  • right-continuous: the cumulative probability jumps at each value of $x$, reflecting the total probability up to and including that point, with no jumps or increases between values of $x$. This means that, for example, $P(X=3) > P(X=2)$, and $P(X=3)$ represents the probability up to and including $3$. There is no increase when $X$ is equal to, say, $2.5$. Recall that the CDF for discrete random variables is a step function.

The cumulative distribution function is calculated by simply summing the probabilities of all possible values less than or equal to a specific value of $x$. That is, $F(x) = P(X \le x) = \sum P(X = k)$, where the sum is taken over all possible values of $k$ that are less than or equal to $x$.

As an example, let us consider the cumulative probabilities of rolling a pair of six-sided dice. The probability of getting $4$ or less is equal to the probability of getting $2$ plus the probability of getting $3$ plus the probability of getting a $4$. So: $P(X = 2) = .0278 + P(X = 3) = .0556 + P(X = 4) = .0833 = F(4) = 0.167$.

Summary

You have come to end of what I hope has provided a comprehensive introduction to statistics and foundation for advanced study. Writing this certainly has been a way to refresh and reinforce my knowledge.

  • This chapter explored the fundamental building blocks of probability and statistics—concepts that underpin virtually all statistical reasoning and decision-making in the modern world.
  • We examined how probability quantifies uncertainty through three distinct lenses: theoretical probability (based on well-defined, equally likely outcomes), empirical probability (derived from real-world observations and experiments), and subjective probability (grounded in personal judgement and belief). We explained the distinction between probability and odds, and how to convert between them—which provided additional flexibility in expressing likelihood.
  • The counting rules—multiplication and addition—serve as essential tools for determining sample spaces and calculating probabilities. The multiplication rule applies to independent and sequential events, while the addition rule handles mutually exclusive events.
  • Building on these foundations, we explored permutations and combinations, distinguishing between scenarios where order matters versus where it doesn’t, and whether replacement is allowed. The “stars and bars” method illuminated how combinations with replacement work, demonstrating how first principles thinking allows us to effectively layer knowledge.
  • We distinguished between discrete and continuous random variables, introducing the probability mass function (PMF) for discrete outcomes and the probability density function (PDF) for continuous ranges. The cumulative distribution function (CDF) applies to both types of variables, albeit with some distinct properties.

Mastering these fundamentals builds the essential groundwork for everything that follows in statistics, and the next synthesis article, which covers probability distributions and conditional probabilities, will complete the theoretical foundation needed for more advanced and practical statistical work.