Introductory Probability for Machine Learning-part 2

Aditya Raj
7 min readSep 13, 2021

For sure,this is not the end of probability for machine learning,these are just the introductory concepts needed for probabilistic perspective of machine learning.We will learn more of probability and much more of statistics along with machine learning concepts and algorithms

Bayes Theorem

We all now know about conditional probability

So, P(AB) = P(A/B)*P(B)

we can say P(AB) = P(BA),so it can be written as

:-> P(A ∩ B) = P(A)P(B|A) , P(A ∩ B) = P(B)P(A|B)

:-> P(B)P(A|B) = P(A)P(B|A)

P(A|B) = (P(A)* P(B|A))/P(B)

This is what we call as Bayes theorem

Now we can make bayes theorem more applicable and general purpose by operating on P(B) i.e denominator.

Let A and Bbe events. We may express B as :- B= BA ∪ BA^c

P(B) = P(BA) + P(BA^c) -> P(B) = P(B|A)P(A) + P(B|A^c)P(A^c)

So,the bayes theorem now can be modified to this formula

Now,suppose B is dependent on more than 2 events,i.e like A and ~A,let it be n events(A1,A2…..An). In this case, the formula can be further generalized to

This is the general purpose bayes formula :-

Likelihood

The distinction between probability and likelihood is fundamentally important: Probability attaches to possible results; likelihood attaches to hypotheses.

Well,we are not philosophical mathematicians,so we tend understand things better with simple examples.So,lets understand the thing with help of an example.

Let their be random variable X with pdf as normal(gaussian) function,with parameter θ →{μ,σ}.so given θ,pdf of X can be written as P(X|θ) = N(μ,σ).

So,The probability function(pdf in continuous case and pmf in discrete) is represented by P(X|θ) predicting probability distribution given hypothetis or parameter (call it anything) , similarly, likelihood function is represented by L(θ|X) predicting parameters or hypothesis given the probability distribution.

Well both are different things,one predicts result while other hypothesis,we need not think much about it and confuse us too much,just know one thing that both are equivalent and used equivalently,so we can replace one with other in our field of study.

L(θ|X) →P(X|θ) (both are equivalent) and have same impact

where f is pdf or pmf of the distribution.

Lets redefine bayes theorem now:-

bayes theorem:- P(β∣y) = P(y∣β) x P(β) / P(y)

redefined with likelihood => P(β∣y) = L(β|y) x P(β) / P(y), the term P(y|β) is equivalent to L(β|y).

We are trying to write a distribution describing probability of β given y,here y is given as an evidence,we assume we have prior knowledge about β so that we can write function about it with evidence,the term P(y|β) is equivalent to L(β|y) i.e likelihood. we like to call P(β|y) as posterior.

now let’s write the formula in english

posterior = likelihood * prior / evidence, this is another important analogy of bayes theorem,it help us to predict probability of event given it’s prior knowledge,evidence and likelihood.

Kindly give few minutes time to https://youtu.be/HZGCoVF3YvM for more indepth understanding if interested.

Maximum Likelihood Estimation

We now know about likelihood function and how it can be expressed in form of probability distribution.

For a given set of numbers or data,we want to find the perfect or best suitable distribution in form of parameters,for eg: what should be values of μ and σ to express the set of numbers as Normal distribution?

So for estimating the parameters of distribution best suitable to given set of data or numbers,we have to determine the θ i.e parameters having highest likelihood for data.

This estimation of parameters having highest likelihood value for given set of data is called maximum likelihood estimation.

The common 2 steps involved in doing MLE is:-

  1. We know this relation:-

so,we first writing the joint probability density or joint probability mass function for the distribution.

2. We then find the value of θ i.e parameters for which value of this joint probability distribution is maximum with calculus(for maximized argument calculation).

Lets do the maximum likelihood estimation of a general Normal(Gaussian) function having mean μ and variance σ.

so θ = {μ,σ},X = {x1,x2,x3,x4,x5……….xn}

The probability density function of a generic term of the sequence is

Support we have the following n i.i.d observations: x1,x2,…,xnx1,x2,…,xn. Because they are independent, we know than for independent random variables,joint probability distribution is the product of all probability distribution.i.e -> f(x1,x2..xn) = f(x1).f(x2)…….f(xn)

So,the joint probability distribution of the data is product of normal function of every data point->

Now we have to maximize this and find θ = {μ,σ} for that maximum.

Since the product form of equations are hard to handle and apply calculus,we apply log on both sides,

To find μ and σ for maximum of this function,we can partially differentiate this w.r.t μ and σ respectively to find it’s correct value,

Let’s call log(f(x1,x2,…,xn|σ,μ))as L then let:

solve this equation, we get

Because σ² should be larger than zero,

Similarly, let

by upsolving,we get that as:-

so,we got the value of μ and σ that will be best estimate for our data.

This method can be applied to any distribution,data to find best set of parameters giving best estimate for the data.

We generally calculate MLE assuming data to IID(Identical Independet dataset),so all above steps are followed:-

  1. Joint probability distribution is written in terms if data and parameters
  2. Assuming IID,it is expressed in form of product of likelihood/probability function for every data point.
  3. Then to convert it in sum form,log likelihood/probability func. is taken.
  4. Partial differentiation is done with respect to every parameters to find suitable estimate.

Maximum A Posteriori (MAP) Estimation

In MLE,To find the parameter θ most suitable for data X,we find the θ that maximizes the likelihood l(θ|X)(or P(X|θ)).

In real world or in some Machine Learning cases,we should not determine θ with it’s total dependecy on X(like in likelihood),so sometimes we tend to maximize posterior instead of likelihood because psoterior function is combination of both likelihood and chance of occuring of θ in real worldi.e Prior,that makes it more suitable for estimation,so finding parameter by maximizing the posterior is called Maximum A Posteriori(MAP) Estimation.

For a dataset X,we can find best estimated parameter θ by MAP as:-

We know that P(X) can be calculated as net integral of X occuring with all possible values of θ,like in below formula of bayes theorem that was explained earlier.

Like P(B) in above diagram is dependent on all values of Aj i.e (A1 , A2…) and not on Ai for P(Ai|B),similarly for P(θi|X) , P(X) is dependent on all values of θ and not on θi.

So,P(X) is constant with respect to a particular θ,therefore while finding θ for maximizing P(θ|X) we cam simply remove P(X) from denominator to make compuation easily solvable.

Again,we can write joint probability for likelihood function,express it in product with prior ,take a log ,do partial differentiation to get estimated θ,

Can you do it for gaussian distribution?

Well,obviously you can try,If you do so,congratulations,you have already cleared one of the most fundamental and important concept needed for machine learning which most ML learners struggle at and is ultra important for learning advanced ML.

Thanks everyone for reading,please like,follow,share and subscribe.

Ping me on +918292098293 and follow me twitter https://twitter.com/AdityaR71244890?s=08

please see my other blogs for studying machine learning from absolute scratch.

--

--