Basics of Calculus for Machine Learning

6 min readAug 31, 2021

Calculus is the soul of “learning” in machine learning.From minimizing the loss between actual and machine predicted in machine or deep learning to optimizations,calculus is what solves all of these.

Well again,calculus is a very deep field of mathematics and we will study only what is required for machine learning.

Lets start with what you already know from your high school.

Differentiation of Univariate Function

Difference quotient for a univariate function y =f(x) defined as slope of secant of graph between x and x + h,

represents slope of secant line between two points

Derivative of function is defined from above with condition h > 0(this condition with lim->0 makes secant as tangent),the slope of tangent between two points,

the secant of above now becomes tangent in graph

Lets look at an example of finding derivative of function y = x^n w.r.t x with this basic definition:-

Taylor series

The taylor series is representation of a function f as sum of infinite no. of terms determined from derivative of f evaluated at x0.

The Taylor polynomial of degree n of f : R →R at x0 is defined as:-

where f^(k)(x0) is the kth derivative of f at x0 and f^(k)(x0)/k! are coefficients of the polynomial and n can be anything to 1 to infinity.

Lets understand it better by looking at taylor series expansion for a function y=x⁴ at x0 = 1 up to n = 6.let’s calculate the derivative first.

now lets put it in the formula to get taylor series representation.

so this is the taylor series representation of x⁴ which can be solved to get back the same.

Rules of Differentiation

Well these are something you all are familiar with,lets just revise with an example of differentiation with chain rule.

That’s fine for recap or maybe some new things also for some(maybe),lets see the multivariate case now.

The Multivariate Case

For the function with scalar variable x ∈ R,we find derivative with simple differentiation.In multivariate case where function f depends on one or more variables x ∈ Rn, e.g., f(x) = f(x1,x2),we generalize the term derivative for several variables to gradient.

More precisely,We find the gradient of the function f with respect to x by varying one variable at a time and keeping the others constant. The gradient is then the collection of these partial derivatives and this differentiation is called partial differentiation.

The gradient is then collection of all these partial derivatives in row vector

Lets again look at the example,for f(x1,x2) = x21x2 + x1x32 ∈ R, the partial derivatives (i.e., the derivatives of f with respect to x1 and x2) are

and the gradient is then

Rules of partial differentiation

The sum and product rule:-

The chain rule:-

Since it is multivariate,let’s consider a function f : R2 → R of two variables x1,x2. Furthermore,x1(t) and x2(t) are themselves functions of t. To compute the gradient of f with respect to t, we need to apply the chain rule for multivariate functions as

Let’s look at an example,f(x1,x2) = x1²+2x2 where x1 = sint and x2 = cost,then

Lets increase our complexity to a more higher level,let us suppose function f(x1,x2) is multivariate function of two variables where x1(s,t) and x2(s,t) are themselves function of s and t,so what is the derivative?

well,we can easily find out partial derivative of f w.r.t to s and t separately,

the gradient can be obtained easily with chain rule formula and matrix multiplication:

Jacobian

The name may seem to be a like high level quantum computation concept if you haven’t heard it before but it can simply be defined as matrix representing set of gradients of vector valued functions or simple a matrix of derivatives.

Let’s understand how we get and operate it with what is called as vector valued functions.For a function f : Rn → Rm and a vector x = [x1,…,xn]> ∈ Rn, the corresponding vector of function values is given as,

Therefore, the partial derivative of a vector-valued function f : Rn →Rm with respect to xi ∈R, i = 1,…n, is given as the vector

here f1 is itself a multivariate function for x i.e (x1,x2….xn),and this can be simply written as what we studied as gradient in below image

gradient for f1(x) can be computed like this,where x : [x1,x2,….xn],

so for whole vector valued function f : [f1,f2,f3….fn],it can be written as:-

We can now define Jacobian,The collection of all first-order partial derivatives of a vector-valued function f : Rn →Rm is called the Jacobian. The Jacobian
J is an m ×n matrix, which we define and arrange as follows:

Hessian

Like Jacobian is for the 1st order derivatives,Hessian is for second order derivatives.

Consider a function f : R2 → R of two variables x,y. We use the
following notation for higher-order partial derivatives (and for gradients):

The Hessian can be denoted as ∇²x,y f(x,y).

These concepts in calculus are very sufficient to understand working of machine learning functions and their optimizations and ML build calculus concepts.

I was very keen to cover things like gradient of vector w.r.t matrix and gradient of matrix w.r.t matrix but they are just basic example of these concepts and also covering them in blog can be confusing,although I will try to cover it in video lecture.

Things like backpropogation and autodifferentiation are also very important which are also just an examples build on this,they will surely be covered while studying use of them in machine learning or deep learning. Again I will try to cover them in lecture releasing soon.

Thanks for reading,please like,subscribe and share if you like it.For any doubt,email me at:- adityaraj20008@gmail.com

Follow me on twitter: https://twitter.com/AdityaR71244890?s=08 and on linkedin: https://www.linkedin.com/in/aditya-raj-553322197.