# Content¶

Machine learning is multidesciplinary and it is useful to be familiar with

- Sets and Types
- Probabillity
- Linear Algebra
- Optimization



You can get away without knowing much of the math by using the online libraries. But knowing the math will help you provide you an intuition on why things work the way they do.

Note: I like working things from the ground up, so sorry if you think it's too basic at times! I will go fast over the easy concepts

## Sets and Types¶

A set is a collection of items or things and we use curly braces to denote set objects.

$$A = \{3, 6, 9\}$$ $$B = \{dog, cat\}$$

The set $A$ consists of 3 elements namely $3, 6, 9$.

Sets you could be familiar with are natural numbers and real numbers, $$\mathbb{N}$$ $$\mathbb{R}$$

We can use set builder notation to build sets:

$$\{x~|~x~\in~\mathbb{Z}\}$$

In plain words: $x$ such that $x$ belongs to the integers

## Cardinality¶

$$A = \{3, 6, 9\}$$

The cardinality of a set $|A|$ is the number of elements in that set.

$$|A|~ \text{is}~3$$

The cardinality of $\mathbb{N}$ is infinite and is called $\mathbb{N}_0$

## Some special sets¶

The universal set $\Omega$ is that set that includes all other sets in your space of events.

For every event $A$ $$A \subseteq \Omega$$

The null set $\phi$ is contained in every other set.

For every event $A$

$$\phi \subseteq A$$

The set $$2^\Omega$$

Consists all subsets in the universal set $\Omega$

## Operators¶

Below is a set of operators that are used with sets. Say we have the following sets $A = \{3, 6, 9\}$ $B = \{3, 6, 7\}$, and $C = \{3, 6, 7, 9\}$.

$$\text{Union of two sets: }~ A\cup B = \{3, 6, 7, 9\}$$ $$\text{Intersection of two sets: }~ A\cap B = \{3, 6\}$$ $$\text{Subset, superset: }~ B \subset C$$ $$\text{Compliment: } A^c \text{is the set of all elements not in A}$$

Two sets are equal only if they have the same elements in both sets.

$$A \neq C$$

## Properties of set operations¶

Commutative law: $$A \cup B = B \cup A$$ $$A \cap B = B \cap A$$ Associative law: $$A \cup (B \cup C) = (A \cup B) \cup C$$ Distributive law: $$(A \cap B) \cup C = (A \cup C) \cap (B \cup C)$$ $$(A \cup B) \cap C = (A \cap C) \cup (B \cap C)$$

De Morgan Laws: $$(A \cup B) = A^c \cap B^c$$ $$(A \cap B)^c = A^c \cup B^c$$

# What is the cardinality of $A \cup B$¶

In [2]:
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
A = [1, 2, 3, 4]
B = [1, 3, 5, 7]
venn2([set(A), set(B)])
plt.show()


$$|A \cup B| = |A| + |B| - |A \cap B|$$

# Probability¶

Determinism - a theory stating that all events are determined by previouse causes.

This is not enough to describe the our world! For example weather patterns, or physical systems at atomic level.

To be able to describe certain events we need randomness.

Randomness are events that have no deterministic causes but occur in an indeterministic way.

The tool that deals with randomness is probability.

## Measure Theory (skip)¶

Probability theory is a branch of measure theory. The axiomatic probability theory is based on a family of subsets with certain properties. These family of subsets are called $\sigma$-field.

A collection $F = \{A_1, A_2, ...\}$ is a $\sigma$-field (or algebra) if it satisfies the following.

1. $\phi \in F$
2. if $A_i \in F$ then $A_i^c \in F$
3. if $A_i \in F$ then $$\bigcup_{1, 2,..} A_i \in F$$

## $\sigma$-algebra and Measurable Space (skip)¶

Some examples of $\sigma$-algebra are

1. The trivial $\sigma$ -field $\{\Omega, \phi\}$
2. Given $\Omega = \{1, 2, 3, 4, 5, 6\}$ an example of a sigma field is $$\{\Omega, \phi, \{2, 4, 5\}\, \{1, 3, 5\}\}$$

The pair $\{\Omega, F\}$ is called a measurable space. Note $F$ is a $\sigma$-algebra.

A real set function $$\mu: F \rightarrow \Re^+$$ is a called measure if $\mu(\phi) = 0$ and $$\mu(\bigcup_{i = 1, 2,..} A_i) = \sum_{i = 1, 2,...}\mu(A_i)$$

## Measure Space and Probability Space (skip)¶

Say we have a universal set $\Omega$, $F$ a $\mu$-field subets of $\Omega$, and a measure $\mu$, defined on the pair $\{\Omega, F\}$ then the triple

$$\{\Omega, F, \mu\}$$

is a measure space.

If we constrain the measure to have the following property:

$$\mu(\Omega) = 1$$

then it is a probability measure.

### Say we have a dice with 6 faces¶

Our univeral set is $$\{1, 2, 3, 4, 5, 6\}$$

Say we have a $$\sigma\text{-alebra} = \{\Omega, \phi, \{1, 3, 5\}, \{2, 4, 6\}\}$$

Then we can define a probability measure $\mu$

$$\mu(\Omega) = 1$$

$$\mu(\{1, 3, 5\}) = \frac{1}{3}$$

$$\mu(\{2, 4, 6\}) = \frac{2}{3}$$

Are we missing anything?

The event when none of the faces happen!

$$\mu(\phi) = 0$$

## Interpreting Probability¶

Two common intpretations for probability are the Frequintist intrepreation and Baysian intrepretation.

## Frequintist Intrepretation¶

Assume a random experiment is performed $N_{x}$ times and an event $B$ occurs $N_{c}$ times.

The frequintist then assign the probability of event $B$ occuring.

$$P(B) = \lim_{N_x \rightarrow \infty}\frac{N_{x}}{N_{c}}$$

Disadvantage of this is that the experiment needs to be repeatidly performed.

What is the probability of being struck by a lightning? You have to run outside everytime there is a thunderstorm and see whether you get struck by lightning or not. It's really hard to do in LA!

## Baysian Intrepratiton¶

In baysian probability instead of using frequency of an event as a measure we instead use a reasonable expectation for that event to occur given information about the context.

So in Baysian inference a hypothesis is given a certain probability, given prior knowledge (prior probability) and updated when new evidence/knowledge is obtained (posterior probability).

## Axiomatic Definition of Probability¶

1. $P(A) \geq 0$
2. For each mutually exclusive event we have $P(A \cup B) = P(A) + P(B)$
3. $P(\Omega) = 1$

Some properties:

1. $P(\phi) = 0$
2. $A \subset B \subset \Omega \rightarrow P(A) \leq P(B)$
3. $A \subseteq \Omega \rightarrow P(A) \leq 1$
4. $P(A^c) = 1 - P(A)$
5. $P(A \cup B) = P(A) + P(B) - P(A \cap B)$

## Let A and B be events with probabilities $P(A) = \frac{3}{4}$ and $P(B) = \frac{1}{3}$ show¶

$$\frac{1}{12} \leq P(A \cap B) \leq \frac{1}{3}$$

$$P(A \cap B) = P(A) + P(B) - P(A \cup B) \leq \frac{1}{3} + \frac{3}{4} - 1$$

$$A \cap B \subseteq A, A \cap B \subseteq B \rightarrow P(A \cap B) \leq min([P(A), P(B)]) = \frac{1}{3}$$

## Conditional Probability¶

The probability of an event given that another event has occured. It can be calculated using bayes formula

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

The way to think of this is say we have $$\Omega = \{1, 2, 3, 4, 5, 6\}$$ and we want to know the probability that an event $A = \{3\}$ occured given that the event odd happened $\{1, 3, 5\}$.

We know that the event odd happened so now we operate in the sample set $\{1, 3, 5\}$. If we assume that each number is equally likely then,

$$P(A|~\text{odd}) = \frac{1}{3}$$

Using bayes rule $P(A|~\text{odd}) = \frac{\frac{1}{6}}{\frac{1}{2}}$

## Independence and Conditional Independence¶

Two events are independent if the occurence of one does not effect the other.

$$P(A \cap B) = P(A)\cdot P(B)$$

Like two fair coin tosses, the probability of getting two heads is

$$P(H) \cdot P(H) = \frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4}$$

Two events are Conditionally independent if

## Random Variables¶

Random variables are functions that map outcomes to real valued numbers.

$$X: \Omega \rightarrow \Re$$

For example, take the previous example of rolling a dice with the possible outcomes of having a odd number ($\{1, 3, 5\}$), or even number ($\{2, 4, 6\}$).

Then the a random variable can be

$$X(\{1, 3, 5\}) = 1$$ $$X(\{2, 4, 6\}) = 20$$

If we assume the each face has equal probability, and the probability of the die landing on none of the faces to be $0$. Then

$$P(x = 1) = \frac{1}{2}$$

$$P(x = 20) = \frac{1}{2}$$

# Example¶

Say you are running a coin experiment and you're tossing the coin N (where N is a finite value) times. You want to count the number of heads.

Then the probability of k heads happening is given by

$$P(X = k) := P(\{\omega : X(\omega) = k\})$$

where $\omega$ is the set of sequences that consist of $k$ heads. We don't care about the order at which the heads occur, just the count.

## Distributions¶

We define a distribution of a random variable. When we write

$$x \sim p(x)$$

We mean that $x$ was sampled from a probability distribution $p(x)$.

This refers to the probability measure of a particular outcome (random variable maps events to real valued numbers!) in your universe.

For a fair dice the discrete probability function is given by a uniform distribution $$P(X) = \begin{cases} 0 ~ \text{for X}\notin\Omega \\ \frac{1}{|\Omega|} \text{for X}\in\Omega \end{cases}$$

so we have $P(X = a) = \frac{1}{6}$.

## Joint and Marginal Ditributions¶

Let an experiment consist of $2$ dice throws, one die is orange and another is green (we are treating them as two different dies). Then the set of all possible outcomes consist of all pairs $\{(1, 1), (1, 2), ...\}$.

If we assume uniform dies, the probability mass function (PMF) is $$P(X, Y) = \frac{1}{36}$$

say we didn't care what the orange die roll was then the marginal is given by

$$P(X) = \sum_{y \in Y}{P(X, y)} = \frac{1}{6}$$

note: this is a discrete probability distribution so we use summation, for continious the distribution is called a probability density function PDF

## Expectation¶

Discrete frandom variable $X$ with values $x = a_1, a_2, ...$ with a probability distribution $P(X)$

The expected value of $X$ is defined as

$$E[X] = \sum_i{P(X = a_i})\cdot a_i$$

For a fair dice with uniform distribution (again dice..!)

$$E[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = 3.5$$

For continious variable

$$E[X] = \int x\cdot p(x)dx$$

## Mean and Variance¶

The mean of a random variable is defined as it's expected value

$$\mu = E[X]$$

and the variance $Var(X)$ is the expected value of the random variable $Y = (X - \mu)^2$

$$\sigma^2 = E[(X - \mu)^2]$$

Some properties of Expected value:

1. $E[aX] = aE[X]$
2. $E[X + Y] = E[X] + E[Y]$
3. $E[XY] = E[X]E[Y]$ if and only if X and Y are independent i.e. P(X, Y) = P(X)P(Y)

## What is the expected value of a random variable $X$ with $x \in \{1, 10\}$ and $X \sim bernouli (p = 0.5)$¶

$$E[X] = 1 \cdot 0.5 + 10 \cdot 0.5 = 5.5$$

## Covariance and Correlation¶

Covariance of a join distribution $p(x, y)$ measures the tendency for x and y to deviate from their means.

$$cov(x, y) = E[(x-\mu_x)(y-\mu_y)]$$

Correlation is a standardized covariance hence it's range will be limited to $[-1, 1]$

$$corr(x, y) = \frac{E[(x-\mu_x)(y-\mu_y)]}{\sigma_x\sigma_y}$$

## Sum of Random Variables¶

If we have a sequence of random variables $X_1, X_2, ..., X_n$ and Y is their sum $$Y = \sum_i^nX_i$$

Find it's mean and variance if they are statistically independent.

$$E[Y] = E[X_1] + E[X_2] + ... + E[X_n]$$

$$Var(Y) = E[\sum_i^n(X_i - E[X_i])\sum_j^n(X_j - E[X_j])]$$ $$=\sum_i^nVar(X_k) + \sum_i^n\sum_j^nCOV(X_i, X_j)$$

## Sample Mean¶

if a set of experiements are performed repeatedly each with random variable $X_i$ and they are independent identically distributed (iid) the sample mean is given by

$$M_n = \frac{1}{n}\sum_i^nX_i$$

The above is an estimator for the true mean of $X$.

We say it is an unbiased estimator if $$E[M_n] = \mu$$ the true mean of $X$.

## Show that $M_n = \frac{1}{n}\sum_i^nX_i$ is an unbiased estimator¶

$$E[M_n] = E[\frac{1}{n}\sum_i^nX_i] = \frac{1}{n}E[\sum_i^nX_i] = \frac{1}{n}\cdot n \cdot E[X_i] = \mu$$

## Law of Large Numbers¶

States that after large enough number of experiments the true sample mean will approach the true mean.

$$\lim_{n\rightarrow \infty} P[|M_n - \mu| < \epsilon] = 1$$

For the strong law of large numbers instead of using a tight bound it is replaced with equality

$$\lim_{n\rightarrow \infty} P[M_n = \mu] = 1$$

## Central Limit Theorem¶

If we have a sequence of iid random varialbes $X_1, X_2, ...$ with finite mean $\mu$ and finite variance $\sigma^2$ and let the sum of the first n random variables in the sequence

$$S_n = X_1 + X_2 + ... + X_n$$

A new random variable with zero mean and finite variance $\sigma^2$ is defined as

$$Z_n = \frac{S_n - n\mu}{\sigma\sqrt{n}}$$

Then the central limit theorem states as n approaches $\infty$

$$F_{Z_n}(z) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^z e^{\frac{-y}{2}}dy$$

# Calculus¶

## Derivative¶

Refer to this table here for a list of derivatives you should be familiar with.

Once you do that take the derivative of

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

$$\sigma(x)' = \sigma(x)(1 - \sigma(x))$$

## Chain Rule¶

$$f(g(x)) = f'(g(x)) \cdot g(x)$$

Say $f(x) = x^2$, and take the derivative of $$\sigma(f(x))$$

$$\sigma(f(x)) = \sigma(f(x))(1 - \sigma(f(x))) \cdot 2x$$

## One more example¶

$z = (1 + e^x) \cdot 2x$ and $y = z^2$ take the derivative of

$$F(y) = e^y$$ with respect to x.

# Partial Derivative¶

Say we have

$$f(x, y) = 2x + 30xy + x^y$$

Take the partial of $f$ with respect to $x$

$$\frac{\partial f}{\partial x} = 2 + 30y + (y - 1) x ^ y$$

When you take partial derivative you treat everything else as constants besides the variable you are taking the derivative with.

Gradient vector of a field vector $F$ will give the direction of greatest rate of increase at a particular point in space.

Say we have a function $F = f(x, y, z)$

$$\triangledown F = \text{grad}~F = (\frac{\partial f}{\partial x} (x, y, z), \frac{\partial f}{\partial y} (x, y, z), \frac{\partial f}{\partial z} (x, y, z))$$

Take the gradient of $F = 2xy + e^{xy} + z$

# Linear Algebra¶

## Matrix Product¶

Different waves of visualizing matrix products

Say you have

$$A = \left(\begin{array}{cc} a & b\\ c & d \end{array}\right)$$

$$B = \left(\begin{array}{cc} e & f\\ g & h \end{array}\right)$$

## Determinant¶

$$detA = a_{11}C_{11} + a_{12}C_{12} + ... + a_{1n}C_{1n}$$

where $C_{xx}$ are the cofactor matrices.

## Invertability¶

A matrix is invertible if it's determinant is non-zero.

An invertible matrix has full rank.

## Computing the Inverse¶

$$B = \left(\begin{array}{cc} 1 & 2 & 3\\ 0 & 4 & 5 \\ 1 & 0 & 6\\ \end{array}\right)$$

Augment matrix method (on board):

$$A^{-1} = \frac{1}{\text{det} ~A} (\text{cofactor}~ A)^T$$

## Computing Eigenvalues and Vectors¶

$x$ is a eigenvector of matrix $A$ if it satisfies

$$Ax = \lambda x$$

where $\lambda$ is an eigenvalue.

Find eigen values using

$$\text{det} ~|A - \lambda I|$$

$$A = \left(\begin{array}{cc} 0 & 1\\ -2 & -3 \end{array}\right)$$

## Eigenvector Decomposition¶

Let $U$ be the eigenvector matrix of $B$

$$U = [v_1, v_2, .., v_n]$$

where $v_1, ..., v_n$ are eigenvectors then

$$B = U^{-1}\Lambda U$$

# Find the eigenvector decomposition¶

$$A = \left(\begin{array}{cc} 2 & 3\\ 2 & 1 \end{array}\right)$$

$$\lambda_1 = 4$$

with eigenvector

$$u_1 = \left(\begin{array}{cc} 3 \\ 2 \end{array}\right)$$

and

$$\lambda_2 = -1$$

with eigenvector

$$u_1 = \left(\begin{array}{cc} -1 \\ 1 \end{array}\right)$$

## Positive Semi-definite¶

A matrix $B$ is PSD if it satisfies

$$x^TBx \geq 0$$

Show that the covariance matrix is positive semidefinite. The covariance matrix is given by

$$\sum = E[(X - E[X])(X - E[X])^T]$$

# Optimization¶

## Convex Functions¶

We are mainly interested in convex functions, as the local minimum is the global minimum.

A function is convex if it's second derivative is always positive

$$\frac{d}{dx^2}f(x) \geq 0$$

In higher dimensions you have to show that the second derivative is positive semi definite.

## Jensens inequality¶

This inequality can be used to show convexity of a function. It says that for convex functions a secant line lies above the actual graph.

$$f(tx_1 + (1 - t)x_2) \leq tf(x_1) + (1 - t)f(x_2)$$

or more generally

$$f(E[x]) \leq E[f(x)]$$

## Convex Functions¶

Is the function

$$f(x) = |x| + e^x$$

convex?

We can use jensens inequality for $|x|$:

$$f(\alpha x + \beta y) = |\alpha x + \beta y| \leq \alpha|x| + \beta|y|$$

The above follows from triangle inequality and hence using jenses theorem is convex.

Further,

$$\frac{d^2e^x}{dx^2} = e^x \geq 0$$

Since either function is convex, their sum must also be convex.