Machine Learning Math Tutorial

By Sajad Darabi

October 6, 2017

If you put yourself in a position where you have to stretch outside your comfort zone, then you are forced to expand your consciousness - Les Brown


Machine learning is multidesciplinary and it is useful to be familiar with

- Sets and Types
- Probabillity 
- Linear Algebra
- Optimization

You can get away without knowing much of the math by using the online libraries. But knowing the math will help you provide you an intuition on why things work the way they do.

Note: I like working things from the ground up, so sorry if you think it's too basic at times! I will go fast over the easy concepts

Sets and Types

A set is a collection of items or things and we use curly braces to denote set objects.

$$A = \{3, 6, 9\}$$ $$B = \{dog, cat\}$$

The set $A$ consists of 3 elements namely $3, 6, 9$.

Sets you could be familiar with are natural numbers and real numbers, $$\mathbb{N}$$ $$\mathbb{R}$$

We can use set builder notation to build sets:


In plain words: $x$ such that $x$ belongs to the integers


$$A = \{3, 6, 9\}$$

The cardinality of a set $|A|$ is the number of elements in that set.

$$|A|~ \text{is}~3$$

The cardinality of $\mathbb{N}$ is infinite and is called $\mathbb{N}_0$

Some special sets

The universal set $\Omega$ is that set that includes all other sets in your space of events.

For every event $A$ $$A \subseteq \Omega$$

The null set $\phi$ is contained in every other set.

For every event $A$

$$\phi \subseteq A$$

The set $$2^\Omega$$

Consists all subsets in the universal set $\Omega$


Below is a set of operators that are used with sets. Say we have the following sets $A = \{3, 6, 9\}$ $B = \{3, 6, 7\}$, and $C = \{3, 6, 7, 9\}$.

$$ \text{Union of two sets: }~ A\cup B = \{3, 6, 7, 9\}$$ $$ \text{Intersection of two sets: }~ A\cap B = \{3, 6\}$$ $$ \text{Subset, superset: }~ B \subset C$$ $$ \text{Compliment: } A^c \text{is the set of all elements not in A} $$

Two sets are equal only if they have the same elements in both sets.

$$A \neq C$$

Properties of set operations

Commutative law: $$A \cup B = B \cup A$$ $$A \cap B = B \cap A$$ Associative law: $$A \cup (B \cup C) = (A \cup B) \cup C$$ Distributive law: $$(A \cap B) \cup C = (A \cup C) \cap (B \cup C)$$ $$(A \cup B) \cap C = (A \cap C) \cup (B \cap C)$$

De Morgan Laws: $$(A \cup B) = A^c \cap B^c$$ $$(A \cap B)^c = A^c \cup B^c$$

What is the cardinality of $A \cup B$

In [1]:
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
A = [1, 2, 3, 4]
B = [1, 3, 5, 7]
venn2([set(A), set(B)])

$$|A \cup B| = |A| + |B| - |A \cap B|$$


Determinism - a theory stating that all events are determined by previouse causes.

This is not enough to describe the our world! For example weather patterns, or physical systems at atomic level.

To be able to describe certain events we need randomness,.

Randomness are events that have no deterministic causes but occur in an indeterministic way.

The tool that deals with randomness is probability.

Measure Theory (skip)

Probability theory is a branch of measure theory. The axiomatic probability theory is based on a family of subsets with certain properties. These family of subsets are called $\sigma$-field.

A collection $F = \{A_1, A_2, ...\}$ is a $\sigma$-field (or algebra) if it satisfies the following.

  1. $\phi \in F$
  2. if $A_i \in F$ then $A_i^c \in F$
  3. if $A_i \in F$ then $$\bigcup_{1, 2,..} A_i \in F$$

$\sigma$-algebra and Measurable Space (skip)

Some examples of $\sigma$-algebra are

  1. The trivial $\sigma$ -field $\{\Omega, \phi\}$
  2. Given $\Omega = \{1, 2, 3, 4, 5, 6\}$ an example of a sigma field is $$\{\Omega, \phi, \{2, 4, 5\}\, \{1, 3, 5\}\}$$

The pair $\{\Omega, F\}$ is called a measurable space. Note $F$ is a $\sigma$-algebra.

A real set function $$\mu: F \rightarrow \Re^+$$ is a called measure if $\mu(\phi) = 0$ and $$\mu(\bigcup_{i = 1, 2,..} A_i) = \sum_{i = 1, 2,...}\mu(A_i)$$

Measure Space and Probability Space (skip)

Say we have a universal set $\Omega$, $F$ a $\mu$-field subets of $\Omega$, and a measure $\mu$, defined on the pair $\{\Omega, F\}$ then the triple $$\{\Omega, F, \mu\}$$

is a measure space.

If we constrain the measure to have the following property:

$$\mu(\Omega) = 1$$

then it is a probability measure.

Say we have a dice with 6 faces

Our univeral set is $$\{1, 2, 3, 4, 5, 6\}$$

Say we have a $$\sigma\text{-alebra} = \{\Omega, \phi, \{1, 3, 5\}, \{2, 4, 6\}\}$$

Then we can define a probability measure $\mu$

$$\mu(\Omega) = 1$$

$$\mu(\{1, 3, 5\}) = \frac{1}{3}$$

$$\mu(\{2, 4, 6\}) = \frac{2}{3}$$

Are we missing anything?

The event when none of the faces happen!

$$\mu(\phi) = 0$$

Interpreting Probability

Two common intpretations for probability are the Frequintist intrepreation and Baysian intrepretation.

Frequintist Intrepretation

Assume a random experiment is performed $N_{x}$ times and an event $B$ occurs $N_{c}$ times.

The frequintist then assign the probability of event $B$ occuring.

$$P(B) = \lim_{N_x \rightarrow \infty}\frac{N_{x}}{N_{c}}$$

Disadvantage of this is that the experiment needs to be repeatidly performed.

What is the probability of being struck by a lightning? You have to run outside everytime there is a thunderstorm and see whether you get struck by lightning or not. It's really hard to do in LA!

Baysian Intrepratiton

In baysian probability instead of using frequency of an event as a measure we instead use a reasonable expectation for that event to occur given information about the context.

So in Baysian inference a hypothesis is given a certain probability, given prior knowledge (prior probability) and updated when new evidence/knowledge is obtained (posterior probability).

Axiomatic Definition of Probability

  1. $P(A) \geq 0$
  2. For each mutually exclusive event we have $P(A \cup B) = P(A) + P(B)$
  3. $P(\Omega) = 1$

Some properties:

  1. $P(\phi) = 0$
  2. $A \subset B \subset \Omega \rightarrow P(A) \leq P(B)$
  3. $A \subseteq \Omega \rightarrow P(A) \leq 1$
  4. $P(A^c) = 1 - P(A)$
  5. $P(A \cup B) = P(A) + P(B) - P(A \cap B)$

Let A and B be events with probabilities $P(A) = \frac{3}{4}$ and $P(B) = \frac{1}{3}$ show

$$\frac{1}{12} \leq P(A \cap B) \leq \frac{1}{3}$$

$$P(A \cap B) = P(A) + P(B) - P(A \cup B) \leq \frac{1}{3} + \frac{3}{4} - 1$$

$$ A \cap B \subseteq A, A \cap B \subseteq B \rightarrow P(A \cap B) \leq min([P(A), P(B)]) = \frac{1}{3}$$

Conditional Probability

The probability of an event given that another event has occured. It can be calculated using bayes formula

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

The way to think of this is say we have $$\Omega = \{1, 2, 3, 4, 5, 6\}$$ and we want to know the probability that an event $A = \{3\}$ occured given that the event odd happened $\{1, 3, 5\}$.

We know that the event odd happened so now we operate in the sample set $\{1, 3, 5\}$. If we assume that each number is equally likely then,

$$P(A|~\text{odd}) = \frac{1}{3}$$

Using bayes rule $P(A|~\text{odd}) = \frac{\frac{1}{6}}{\frac{1}{2}}$

Independence and Conditional Independence

Two events are independent if the occurence of one does not effect the other.

$$P(A \cap B) = P(A)\cdot P(B)$$

Like two fair coin tosses, the probability of getting two heads is

$$P(H) \cdot P(H) = \frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4}$$

Two events are Conditionally independent if

Random Variables

Random variables are functions that map outcomes to real valued numbers.

For example, take the previous example of rolling a dice with the possible outcomes of having a odd number ($\{1, 3, 5\}$), or even number ($\{2, 4, 6\}$).

Then the a random variable can be

$$X(\{1, 3, 5\}) = 1$$ $$X(\{2, 4, 6\}) = 20$$

If we assume the each face has equal probability, and the probability of the die landing on none of th faces to be $0$. Then

$$P(X = 1) = \frac{1}{2}$$

$$P(X = 20) = \frac{1}{2}$$


We define a distribution of a random variable. This refers to the probability measure of a particular outcome (random variable maps events to real valued numbers!) in your universe.

For a fair dice the discrete probability function is given by a uniform distribution $$P(X) = \begin{cases} 0 ~ \text{for X}\notin\Omega \\ \frac{1}{|\Omega|} \text{for X}\in\Omega \end{cases}$$

so we have $P(X = a) = \frac{1}{6}$.

Joint and Marginal Ditributions

Let an experiment consist of $2$ dice throws, one die is orange and another is green (we are treating them as two different dies). Then the set of all possible outcomes consist of all pairs $\{(1, 1), (1, 2), ...\}$.

If we assume uniform dies, the probability mass function (PMF) is $$P(X, Y) = \frac{1}{36}$$

say we didn't care what the orange die roll was then the marginal is given by

$$P(X) = \sum_{y \in Y}{P(X, y)} = \frac{1}{6}$$

note: this is a discrete probability distribution so we use summation, for continious the distribution is called a probability density function PDF


Discrete frandom variable $X$ with values $x = a_1, a_2, ...$ with a probability distribution $P(X)$

The expected value of $X$ is defined as

$$E[X] = \sum_i{P(X = a_i})\cdot a_i$$

For a fair dice with uniform distribution (again dice..!)

$$E[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = 3.5$$

For continious variable

$$E[X] = \int x\cdot p(x)dx$$

Mean and Variance

The mean of a random variable is defined as it's expected value

$$\mu = E[X]$$

and the variance $Var(X)$ is the expected value of the random variable $ Y = (X - \mu)^2$

$$\sigma^2 = E[(X - \mu)^2]$$

Some properties of Expected value:

  1. $E[aX] = aE[X]$
  2. $E[X + Y] = E[X] + E[Y]$
  3. $E[XY] = E[X]E[Y]$ if and only if X and Y are independent i.e. P(X, Y) = P(X)P(Y)

What is the expected value of a random variable $X$ with $ x \in \{1, 10\}$ and $X \sim bernouli (p = 0.5)$

$$E[X] = 1 \cdot 0.5 + 10 \cdot 0.5 = 5.5$$

Covariance and Correlation

Covariance of a join distribution $p(x, y)$ measures the tendency for x and y to deviate from their means.

$$cov(x, y) = E[(x-\mu_x)(y-\mu_y)]$$

Correlation is a standardized covariance hence it's range will be limited to $[-1, 1]$

$$corr(x, y) = \frac{E[(x-\mu_x)(y-\mu_y)]}{\sigma_x\sigma_y}$$

Sum of Random Variables

If we have a sequence of random variables $X_1, X_2, ..., X_n$ and Y is their sum $$Y = \sum_i^nX_i$$

Find it's mean and variance if they are statistically independent.

$$E[Y] = E[X_1] + E[X_2] + ... + E[X_n]$$

$$Var(Y) = E[\sum_i^n(X_i - E[X_i])\sum_j^n(X_j - E[X_j])]$$ $$=\sum_i^nVar(X_k) + \sum_i^n\sum_j^nCOV(X_i, X_j)$$

Sample Mean

if a set of experiements are performed repeatedly each with random variable $X_i$ and they are independent identically distributed (iid) the sample mean is given by

$$M_n = \frac{1}{n}\sum_i^nX_i$$

The above is an estimator for the true mean of $X$.

We say it is an unbiased estimator if $$E[M_n] = \mu$$ the true mean of $X$.

Show that $M_n = \frac{1}{n}\sum_i^nX_i$ is an unbiased estimator

$$E[M_n] = E[\frac{1}{n}\sum_i^nX_i] = \frac{1}{n}E[\sum_i^nX_i] = \frac{1}{n}\cdot n \cdot E[X_i] = \mu$$

Law of Large Numbers

States that after large enough number of experiments the true sample mean will approach the true mean.

$$\lim_{n\rightarrow \infty} P[|M_n - \mu| < \epsilon] = 1$$

For the strong law of large numbers instead of using a tight bound it is replaced with equality

$$\lim_{n\rightarrow \infty} P[M_n = \mu] = 1$$

Central Limit Theorem

If we have a sequence of iid random varialbes $X_1, X_2, ...$ with finite mean $\mu$ and finite variance $\sigma^2$ and let the sum of the first n random variables in the sequence

$$S_n = X_1 + X_2 + ... + X_n$$

A new random variable with zero mean and finite variance $\sigma^2$ is defined as

$$Z_n = \frac{S_n - n\mu}{\sigma\sqrt{n}}$$

Then the central limit theorem states as n approaches $\infty$

$$F_{Z_n}(z) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^z e^{\frac{-y}{2}}dy$$



Refer to this table here for a list of derivatives you should be familiar with.

Once you do that take the derivative of

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

$$\sigma(x)' = \sigma(x)(1 - \sigma(x))$$

Chain Rule

$$f(g(x)) = f'(g(x)) \cdot g(x)$$

Say $f(x) = x^2$, and take the derivative of $$\sigma(f(x))$$

$$\sigma(f(x)) = \sigma(f(x))(1 - \sigma(f(x))) \cdot 2x$$

One more example

$z = (1 + e^x) \cdot 2x$ and $y = z^2$ take the derivative of

$$F(y) = e^y$$ with respect to x.

Partial Derivative

Say we have

$$f(x, y) = 2x + 30xy + x^y$$

Take the partial of $f$ with respect to $x$

$$\frac{\partial f}{\partial x} = 2 + 30y + (y - 1) x ^ y$$

When you take partial derivative you treat everything else as constants besides the variable you are taking the derivative with.


Gradient vector of a field vector $F$ will give the direction of greatest rate of increase at a particular point in space.

Say we have a function $F = f(x, y, z)$

$$\triangledown F = \text{grad}~F = (\frac{\partial f}{\partial x} (x, y, z), \frac{\partial f}{\partial y} (x, y, z), \frac{\partial f}{\partial z} (x, y, z)) $$

Take the gradient of $F = 2xy + e^{xy} + z$

Linear Algebra

Matrix Product

Different waves of visualizing matrix products

Say you have

$$ A = \left(\begin{array}{cc} a & b\\ c & d \end{array}\right)$$

$$B = \left(\begin{array}{cc} e & f\\ g & h \end{array}\right) $$

Inner Product Way

Outer Product Way


$$detA = a_{11}C_{11} + a_{12}C_{12} + ... + a_{1n}C_{1n}$$

where $C_{xx}$ are the cofactor matrices.


A matrix is invertible if it's determinant is non-zero.

An invertible matrix has full rank.

Computing the Inverse

$$B = \left(\begin{array}{cc} 1 & 2 & 3\\ 0 & 4 & 5 \\ 1 & 0 & 6\\ \end{array}\right) $$

Augment matrix method (on board):

Adjoint method

$$A^{-1} = \frac{1}{\text{det} ~A} (\text{cofactor}~ A)^T$$

Computing Eigenvalues and Vectors

$x$ is a eigenvector of matrix $A$ if it satisfies

$$Ax = \lambda x$$

where $\lambda$ is an eigenvalue.

Find eigen values using

$$\text{det} ~|A - \lambda I|$$

$$ A = \left(\begin{array}{cc} 0 & 1\\ -2 & -3 \end{array}\right)$$

Eigenvector Decompisition

Let $U$ be the eigen vector matrix of $B$

$$U = [v_1, v_2, .., v_n]$$

where $v_1, ..., v_n$ are eigen vectors then

$$B = U^{-1}\Lambda U$$

Positive Semi-definite

A matrix $B$ is PSD if it satisfies

$$x^TBx \geq 0$$

Show that the covariance matrix is positive semidefinite. The covariance matrix is given by

$$\sum = E[(X - E[X])(X - E[X])^T]$$

Derivative with respect to vector


Convex Functions

We are mainly interested in convex functions, as the local minimum is the global minimum.

A function is convex if it's second derivative is always positive

$$\frac{d}{dx^2}f(x) \geq 0$$

In higher dimensions you have to show that the second derivative is positive semi definite.

Jensens inequality

This inequality can be used to show convexity of a function. It says that for convex functions a secant line lies above the actual graph.

$$f(tx_1 + (1 - t)x_2) \leq tf(x_1) + (1 - t)f(x_2)$$

or more generally

$$f(E[x]) \leq E[f(x)]$$

Convex Functions

Is the function

$$f(x) = |x| + e^x$$


We can use jensens inequality for $|x|$:

$$f(\alpha x + \beta y) = |\alpha x + \beta y| \leq \alpha|x| + \beta|y|$$

The above follows from triangle inequality and hence using jenses theorem is convex.


$$\frac{d^2e^x}{dx^2} = e^x \geq 0$$

Since either function is convex, their sum must also be convex.

That's all for now