Introductory probability for machine learning-part 1

13 min readSep 6, 2021

Every learnable or data based mathematical model in this world has the probabilistic interpretation or intuition behind it.Almost all machine learning or deep learning models have the same property.Great knowledge of probability is something which can allow you to understand every machine learning concept.Even most of the latest ML advancements like GANS,Auto-encoders were developed with great probabilistic intuitions.Since it is atmost required to understand probabilistic interpretation of all standard ML models,we definitely need some good concept of Probability.

Again probability is very deep(infinite for me) field,I will discuss only what is required for this machine learning course.I will make three parts for probability,this being first and second will also be released within 2 days and 3rd blog will be on some advance probability concepts and will be realeased after covering some basic models of machine learning.

Lets set some terms right first:-

Sample space(S) :- Set of all possible outcomes for an experiment.for eg:- experiment: flipping a coin , S1- {H,T} , exp: rolling dice ,S2- {1,2,3,4,5,6}.

Event(E) :- Any subset E of sample space S. E1- {H} of S1(only head), E2 = {2,4,6}(only even outcome of dice)

EUF :- New event EUF is defined when to consist events of S which are present in E or F or in both. eg:- E = {1,2}, F = {3,4}: EUF = {1,2,3,4}.

E∩F :- New event E∩F is defined when to consist events of S which are present both in E and F. eg:- E = {1,2,},F = {2,3}: E∩F = {2}.If E∩F = Φ ,then E and F are consisdere mutually exclusive event.

Unions:- If E1,E2 ……En are events,then union of these events Union(E1,….En) is defined to be a event that consists of all outcome in Ei for at least one value of i = {1,2,3…..n).

Intersections:- If E1,E2 ……En are events,then intersection of these events Intersection(E1,….En) is defined to be a event that consists of all outcome in Ei for at all values of i = {1,2,3…..n).

E^c(compliment of E):- Event which consist of all outcomes of S which is not present in E.

EUE^c = S, E∩F = Φ, S^c = Φ

For event E in S,P(E) is denoted as probablity of occuring of event E in S where it follow some axioms:-

0 ≤ P(E) ≤ 1
P(S) = 1
for mutually exclusive events P(union of all E) = sum(P(Ei’s) amd P(intersection of all E) = 0

P(S) = 1 => P(EUE^c) = 1 => P(E) + P(E^c) = 1 => P(E^c) = 1-P(E)

If E and F are not mutually exclusive :- P(E) + P(F) = P(EUF) + P(E∩F)

1.P(EUF) = P(E) + P(F)- P(E∩F)

2.P(E ∪ F ∪ G)
= P(E) + P(F) − P(EF) + P(G) − P(EG ∪ FG)
= P(E) + P(F) − P(EF) + P(G) − P(EG) − P(FG) + P(EGFG)
= P(E) + P(F) + P(G) − P(EF) − P(EG) − P(FG) + P(EFG)

Finally from above patterns,we can obtain a formula:-

Conditional Probability

Let E and F be two events,then conditional probability that E will occur given that F is already occured is denoted by P(E|F).Lets understand it with an example.

We calculate P(E/F) = P(E∩F)/P(F) i.e conditional probability of E given F has occured = Probability of intersection of E and F(EUF) relative to probability of F.i.e P(E∩F) = P(E/F)*P(F)

Let’s understand with example,Find probability that sum of outcome of two dice when rolled is 6 given first dice comes out as 4.

S = ({4,1},{4,2},{4,3},{4,4},{4,5},{4,6}), E = {4,2} , so P(E) = 1/6

solving same with conditional probability = P(E∩F)/P(F),so,

P(E∩F) = Probability of sum : 6 and first dice : 4 = P({4,2}) = 1/36

P(F) = Probability of first dice : 4 = P({4,1},{4,2}….{4,6}) = 6/36

P(E/F) = (1/36)/(6/36) = 1/6

P(E/F) = P(E∩F)/P(F) => P(E∩F) = P(E/F)*P(F)

Independent Events :- Two events E and F are said to be independent if P(E∩F) = P(E)*P(F).i.e -> P(E/F) = P(E), other wise they are called dependent events.

The events E1,E2,E3,…..En are called independent if for every subset E1,E2 …Er ,r ≤ n, of these events

P(E1E2……Er) = P(E1).P(E2).P(E3)………………………P(Er)

Pairwise independent events are not needed to be independent among themselves.

Chain rule

The conditional probability rule defined above for two events P(E∩F) = P(E/F)*P(F) can be extended for more than two events,for example, for four events A1,A2,A3,A4 it can be written as

For n events it can be generalized to :-

Random Variable

The real values function defined on sample space is called random variable.

Example:- Let X denote a random variable that is defined as sum of two fair dice rolled together, => P(X=2) = P({1,1}) = 1/36, P(X=5) = P({1,4},{2,3},{3,2},{4,1}) = 1/9 = 4/36

Example:- Lets suppose we toss a coin having probability p for coming up heads after N flips resulted in tail.Deifne random variable N;-

P(N=0) = N.A (0 times toss)

P(N=1) = p (first time head)

P(N=2) = (1-p).p(head after 1 tail)

P(N=3) = (1-p)².p(head after 2 tails)

P(N=n) = ((1-p)^n-1).p(head after n-1 tails)

Discrete random Variable:- Random Variable defined on finite or countable i.e discrete possible set of values from sample space like examples above are called discrete random variable.examples given above are of discrete r.v .

Continuous Random Variable:- Random Variables defined on continuum of possible values i.e range of values contaning infinite elements between are called continuous random variable.example:- random variable X denoting lifetime of car,P(X ≥ 3) or P(2 ≤ X ≤ 10).

Cummulative Distribution function:- It is the distribution function(F.) of the random variable X defined for any real no. b, where b lies from -inf to inf , (inf = infinity), such that F(b) = P(X ≤ b).

CDF F is defined as F(b) denotes probability that the random variable X takes on a value that is less than or equal to b.It can also be understood as F(b) is sum of all probabilities P(X) where X≤ b.

CDF can be answer to many probability question about X,like :-

P(a ≤ X ≤ b) = F(b)-F(a).

Some properties a cdf follow:-

I mean,these rules are obvious,since cdf F(a) means probability that a random variable’s value is upto max a,so if a is inf,random variable can cover value upto inf,so F(a) = 1 similarly second rule and for the third,if b > a,F(b) means R.V has covered all value upto b even a,and after a also upto b,so F(b) > F(a).

Discrete Random Variable

As ,discussed discrete random variable is random variable defined on countable/discrete set of values.

For a discrete random variable X, a Probability Mass Function(pmf) P(a) is defines Probability that the value of random variable X = a,i.e pmf(a) = P(X=a).

The cummulative distribution function F for a discrete random variable can be defined as the sum of it’s probability mass function.

The Bernoulli Random Variable

Suppose that a trial, or an experiment, whose outcome can be classified as either a “success” or as a “failure” is performed. If we let X equal 1 if the outcome is a success and 0 if it is a failure, then the probability mass function of X is given by

p(0) = P(X = 0) = p

p(1) = P(X = 1) = 1-p

The random variable following this equation is called bernoulli random variable.

The Binomial Random Variable

Suppose that n independent trials, each of which results in a “success” with probability p and in a “failure” with probability 1 − p, are to be performed. If X represents the number of successes that occur in the n trials, then X is said to be a binomial random variable with parameters (n, p). The probability mass function of a binomial random variable having parameters (n, p) is given by

The probability mass function of a binomial random variable having parameters (n, p) is given by

The Geometric Random Variable

Suppose that independent trials, each having probability p of being a success, are performed until a success occurs. If we let X be the number of trials required until the first success, then X is said to be a geometric random variable with parameter p. Its probability mass function is given by

The Poisson Random Variable

A random variable X, taking on one of the values 0, 1, 2, … , is said to be a Poisson random variable with parameter λ, if for some λ > 0,

Continuous Random variable

Let X be such a random variable. We say that X is a continuous random variable if there exists a nonnegative function f(x), defined for all real x ∈ (−∞,∞), having the property that for any set B of real numbers

here f(x) is called the probability density function for a random variable x. It can be compared with probability mass function of discrete case.

Probability density function(pdf) can be defined as a function of a continuous random variable, whose integral across an interval gives the probability that the value of the variable lies within the same interval.
The integral of pdf from −∞ to a value b gives the probability of random variable to be ≤ b and is what we call as cummulative distribution function(cdf).

The probability of X between a range can be obtained by integral of pdf over that range.

The a=b equation clarifies that probability that a continuous random variable will assume a particular value is 0.

The Uniform Random Variable

X is a uniform random variable on the interval (α, β) if its probability density function is given by

A random variable is said to be uniformly distributed over the interval (0, 1) if its probability density function is given by

Exponential Random Variables

A continuous random variable whose probability density function is given, for some λ > 0, by

is said to be an exponential random variable with parameter λ

The cdf of exponential random variable can be calculated as

Gamma Random Variables

A continuous random variable whose density is given by

for some λ > 0, α > 0 is said to be a gamma random variable with parameters α, λ.

The quantity (α) is called the gamma function and is defined by

Normal Random Variables

X is a normal random variable (or simply that X is normally distributed) with parameters μ and σ2 if the density of X is given by

This density function is a bell-shaped curve that is symmetric around μ

An important fact about normal random variables is that if X is normally distributed with parameters μ and σ2 then Y = αX + β is normally distributed with parameters αμ + β and α2σ2.This can be proved easily as:-

where the last equality is obtained by the change in variables v = αx + β.

Expectation of Random Variable

Expectation of a random variable X is a weighted average of the possible values that X can take on,each value beign weighted by the probability that X assumes that value.

Discrete case:- If X is a discrete random variable having a probability mass function p(x), then the expected value of X is defined by

Find E[X] where X is the outcome when we roll a fair die.

Since p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6 , we obtain E[X] = 1( 1 /6 ) + 2( 1/6 ) + 3( 1/6 ) + 4( 1/6 ) + 5( 1/6 ) + 6( 1/6 ) = 7/2

Expectation of a Bernoulli Random Variable:- E[X] = 0(1 − p) + 1(p) = p

Expectation of a Binomial Random Variable:-

similarly,expectations of rest discrete random variables can be created

Continuous case:- If X is a continuous random variable having a probability density function f(x), then the expected value of X is defined by

Expectation of a Uniform Random Variable:-

Expectation of a Normal Random Variable:-

Expectation of function of random variable

We now know how to calculate expectation of a random variable function,but we also need to study about expectation of function of random variable.It will again depend on if X is discrete or continuous random variable.

If X is a discrete random variable with probability mass function p(x), then for any real-valued function g,

example:- Suppose X has the following probability mass function: p(0) = 0.2, p(1) = 0.5, p(2) = 0.3, calculate E[X²]

E[X2] = 02(0.2) + (12)(0.5) + (22)(0.3) = 1.7

2. If X is a continuous random variable with probability density function f(x), then for any real-valued function g,

example:- Let X be uniformly distributed over (0, 1). Calculate E[X3].

(since f(x) = 1, 0 < x < 1)

E[X3] = ∫X³.1.dx(from 0 to 1) = [X⁴/4](0 to 1) = 1/4

The expected value of a random variable X, E[X], is also referred to as the mean or the first moment of X. The quantity E[X^n], n ≥ 1, is called the nth moment of X.

Variance:- The variance of X measures the expected square of the deviation of X from its expected value.It is given as:-

Var(X) = E [(X − E[X])]²

we can use above formula to calculate variance of any districbution.

Jointly Distributed Random Variables

We have studied concepts regarding a single random variable X,however in real life a distribution may depend on multiple random variable so we derive concepts of jointly distributed random variable.

For any two random variables X and Y, the joint cumulative probability distribution function of X and Y is given by,

F(a, b) = P{X ≤ a, Y ≤ b}, −∞ < a, b < ∞

The distribution of X can be obtained from the joint distribution of X and Y as follows:

FX(a) = P{X ≤ a} = P{X ≤ a, Y < ∞} = F(a,∞)

FY(b) = P{Y ≤ b} = F(∞, b)

In the case where X and Y are both discrete random variables, it is convenient to define the joint probability mass function of X and Y by

p(x, y) = P{X = x, Y = y}

The probability mass function of X may be obtained from p(x, y) by

We say that X and Y are jointly continuous if there exists a function f(x, y), defined for all real x and y, having the property that for all sets A and B of real numbers