Machine learning is multidesciplinary and it is useful to be familiar with

```
- Sets and Types
- Probabillity
- Linear Algebra
- Optimization
```

You can get away without knowing much of the math by using the online libraries. But knowing the math will help you provide you an intuition on why things work the way they do.

**Note: I like working things from the ground up, so sorry if you think it's too basic at times! I will go fast over the easy concepts**

A set is a collection of items or things and we use curly braces to denote set objects.

$$A = \{3, 6, 9\}$$ $$B = \{dog, cat\}$$

The set $A$ consists of 3 elements namely $3, 6, 9$.

Sets you could be familiar with are natural numbers and real numbers, $$\mathbb{N}$$ $$\mathbb{R}$$

We can use set builder notation to build sets:

$$\{x~|~x~\in~\mathbb{Z}\}$$

In plain words: $x$ such that $x$ belongs to the integers

$$A = \{3, 6, 9\}$$

The cardinality of a set $|A|$ is the number of elements in that set.

$$|A|~ \text{is}~3$$

The cardinality of $\mathbb{N}$ is infinite and is called $\mathbb{N}_0$

The universal set $\Omega$ is that set that includes all other sets in your space of events.

For every event $A$ $$A \subseteq \Omega$$

The null set $\phi$ is contained in every other set.

For every event $A$

$$\phi \subseteq A$$

The set $$2^\Omega$$

Consists all subsets in the universal set $\Omega$

Below is a set of operators that are used with sets. Say we have the following sets $A = \{3, 6, 9\}$ $B = \{3, 6, 7\}$, and $C = \{3, 6, 7, 9\}$.

$$ \text{Union of two sets: }~ A\cup B = \{3, 6, 7, 9\}$$ $$ \text{Intersection of two sets: }~ A\cap B = \{3, 6\}$$ $$ \text{Subset, superset: }~ B \subset C$$ $$ \text{Compliment: } A^c \text{is the set of all elements not in A} $$

Two sets are equal only if they have the same elements in both sets.

$$A \neq C$$

Commutative law: $$A \cup B = B \cup A$$ $$A \cap B = B \cap A$$ Associative law: $$A \cup (B \cup C) = (A \cup B) \cup C$$ Distributive law: $$(A \cap B) \cup C = (A \cup C) \cap (B \cup C)$$ $$(A \cup B) \cap C = (A \cap C) \cup (B \cap C)$$

De Morgan Laws: $$(A \cup B) = A^c \cap B^c$$ $$(A \cap B)^c = A^c \cup B^c$$

In [2]:

```
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
A = [1, 2, 3, 4]
B = [1, 3, 5, 7]
venn2([set(A), set(B)])
plt.show()
```

$$|A \cup B| = |A| + |B| - |A \cap B|$$

**Determinism** - a theory stating that all events are determined by previouse causes.

This is not enough to describe the our world! For example weather patterns, or physical systems at atomic level.

To be able to describe certain events we need **randomness**.

Randomness are events that have no deterministic causes but occur in an indeterministic way.

The tool that deals with randomness is probability.

Probability theory is a branch of measure theory. The axiomatic probability theory is based on a family of subsets with certain properties. These family of subsets are called $\sigma$-field.

A collection $F = \{A_1, A_2, ...\}$ is a $\sigma$-field (or algebra) if it satisfies the following.

- $\phi \in F$
- if $A_i \in F$ then $A_i^c \in F$
- if $A_i \in F$ then $$\bigcup_{1, 2,..} A_i \in F$$

Some examples of $\sigma$-algebra are

- The trivial $\sigma$ -field $\{\Omega, \phi\}$
- Given $\Omega = \{1, 2, 3, 4, 5, 6\}$ an example of a sigma field is $$\{\Omega, \phi, \{2, 4, 5\}\, \{1, 3, 5\}\}$$

The pair $\{\Omega, F\}$ is called a measurable space. Note $F$ is a $\sigma$-algebra.

Say we have a universal set $\Omega$, $F$ a $\mu$-field subets of $\Omega$, and a measure $\mu$, defined on the pair $\{\Omega, F\}$ then the triple

$$\{\Omega, F, \mu\}$$

is a measure space.

If we constrain the measure to have the following property:

$$\mu(\Omega) = 1$$

then it is a probability measure.

Our univeral set is $$\{1, 2, 3, 4, 5, 6\}$$

Say we have a $$\sigma\text{-alebra} = \{\Omega, \phi, \{1, 3, 5\}, \{2, 4, 6\}\}$$

Then we can define a probability measure $\mu$

$$\mu(\Omega) = 1$$

$$\mu(\{1, 3, 5\}) = \frac{1}{3}$$

$$\mu(\{2, 4, 6\}) = \frac{2}{3}$$

Are we missing anything?

The event when none of the faces happen!

$$\mu(\phi) = 0$$

Two common intpretations for probability are the **Frequintist intrepreation** and **Baysian intrepretation.**

Assume a random experiment is performed $N_{x}$ times and an event $B$ occurs $N_{c}$ times.

The frequintist then assign the probability of event $B$ occuring.

$$P(B) = \lim_{N_x \rightarrow \infty}\frac{N_{x}}{N_{c}}$$

Disadvantage of this is that the experiment needs to be repeatidly performed.

What is the probability of being struck by a lightning? You have to run outside everytime there is a thunderstorm and see whether you get struck by lightning or not. It's really hard to do in LA!

In baysian probability instead of using frequency of an event as a measure we instead use a reasonable expectation for that event to occur given information about the context.

- $P(A) \geq 0$
- For each mutually exclusive event we have $P(A \cup B) = P(A) + P(B)$
- $P(\Omega) = 1$

Some properties:

- $P(\phi) = 0$
- $A \subset B \subset \Omega \rightarrow P(A) \leq P(B)$
- $A \subseteq \Omega \rightarrow P(A) \leq 1$
- $P(A^c) = 1 - P(A)$
- $P(A \cup B) = P(A) + P(B) - P(A \cap B)$

$$\frac{1}{12} \leq P(A \cap B) \leq \frac{1}{3}$$

$$P(A \cap B) = P(A) + P(B) - P(A \cup B) \leq \frac{1}{3} + \frac{3}{4} - 1$$

$$ A \cap B \subseteq A, A \cap B \subseteq B \rightarrow P(A \cap B) \leq min([P(A), P(B)]) = \frac{1}{3}$$

The probability of an event given that another event has occured. It can be calculated using bayes formula

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

The way to think of this is say we have $$\Omega = \{1, 2, 3, 4, 5, 6\}$$ and we want to know the probability that an event $A = \{3\}$ occured given that the event odd happened $\{1, 3, 5\}$.

We know that the event odd happened so now we operate in the sample set $\{1, 3, 5\}$. If we assume that each number is equally likely then,

$$P(A|~\text{odd}) = \frac{1}{3}$$

Using bayes rule $P(A|~\text{odd}) = \frac{\frac{1}{6}}{\frac{1}{2}}$

Two events are independent if the occurence of one does not effect the other.

$$P(A \cap B) = P(A)\cdot P(B)$$

Like two fair coin tosses, the probability of getting two heads is

$$P(H) \cdot P(H) = \frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4}$$

Two events are Conditionally independent if

Random variables are functions that map outcomes to real valued numbers.

$$X: \Omega \rightarrow \Re$$

For example, take the previous example of rolling a dice with the possible outcomes of having a odd number ($\{1, 3, 5\}$), or even number ($\{2, 4, 6\}$).

Then the a random variable can be

$$X(\{1, 3, 5\}) = 1$$ $$X(\{2, 4, 6\}) = 20$$

If we assume the each face has equal probability, and the probability of the die landing on none of the faces to be $0$. Then

$$P(x = 1) = \frac{1}{2}$$

$$P(x = 20) = \frac{1}{2}$$

Say you are running a coin experiment and you're tossing the coin N (where N is a finite value) times. You want to count the number of heads.

Then the probability of k heads happening is given by

$$P(X = k) := P(\{\omega : X(\omega) = k\})$$

We define a distribution of a random variable. When we write

$$x \sim p(x)$$

We mean that $x$ was sampled from a probability distribution $p(x)$.

This refers to the probability measure of a particular outcome (random variable maps events to real valued numbers!) in your universe.

For a fair dice the discrete probability function is given by a uniform distribution $$P(X) = \begin{cases} 0 ~ \text{for X}\notin\Omega \\ \frac{1}{|\Omega|} \text{for X}\in\Omega \end{cases}$$

so we have $P(X = a) = \frac{1}{6}$.

Let an experiment consist of $2$ dice throws, one die is orange and another is green (we are treating them as two different dies). Then the set of all possible outcomes consist of all pairs $\{(1, 1), (1, 2), ...\}$.

If we assume uniform dies, the **probability mass function** (PMF) is $$P(X, Y) = \frac{1}{36}$$

say we didn't care what the orange die roll was then the marginal is given by

$$P(X) = \sum_{y \in Y}{P(X, y)} = \frac{1}{6}$$

**note: this is a discrete probability distribution so we use summation, for continious the distribution is called a probability density function PDF**

Discrete frandom variable $X$ with values $x = a_1, a_2, ...$ with a probability distribution $P(X)$

The expected value of $X$ is defined as

$$E[X] = \sum_i{P(X = a_i})\cdot a_i$$

For a fair dice with uniform distribution (again dice..!)

$$E[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = 3.5$$

For continious variable

$$E[X] = \int x\cdot p(x)dx$$

The mean of a random variable is defined as it's expected value

$$\mu = E[X]$$

and the variance $Var(X)$ is the expected value of the random variable $ Y = (X - \mu)^2$

$$\sigma^2 = E[(X - \mu)^2]$$

Some properties of Expected value:

- $E[aX] = aE[X]$
- $E[X + Y] = E[X] + E[Y]$
- $E[XY] = E[X]E[Y]$ if and only if X and Y are independent i.e. P(X, Y) = P(X)P(Y)

$$E[X] = 1 \cdot 0.5 + 10 \cdot 0.5 = 5.5$$

Covariance of a join distribution $p(x, y)$ measures the tendency for x and y to deviate from their means.

$$cov(x, y) = E[(x-\mu_x)(y-\mu_y)]$$

Correlation is a standardized covariance hence it's range will be limited to $[-1, 1]$

$$corr(x, y) = \frac{E[(x-\mu_x)(y-\mu_y)]}{\sigma_x\sigma_y}$$

If we have a sequence of random variables $X_1, X_2, ..., X_n$ and Y is their sum $$Y = \sum_i^nX_i$$

Find it's mean and variance if they are statistically independent.

$$E[Y] = E[X_1] + E[X_2] + ... + E[X_n]$$

$$Var(Y) = E[\sum_i^n(X_i - E[X_i])\sum_j^n(X_j - E[X_j])]$$ $$=\sum_i^nVar(X_k) + \sum_i^n\sum_j^nCOV(X_i, X_j)$$

if a set of experiements are performed repeatedly each with random variable $X_i$ and they are independent identically distributed (iid) the sample mean is given by

$$M_n = \frac{1}{n}\sum_i^nX_i$$

The above is an estimator for the true mean of $X$.

We say it is an **unbiased** estimator if $$E[M_n] = \mu$$ the true mean of $X$.

States that after large enough number of experiments the true sample mean will approach the true mean.

$$\lim_{n\rightarrow \infty} P[|M_n - \mu| < \epsilon] = 1$$

For the strong law of large numbers instead of using a tight bound it is replaced with equality

$$\lim_{n\rightarrow \infty} P[M_n = \mu] = 1$$

If we have a sequence of iid random varialbes $X_1, X_2, ...$ with finite mean $\mu$ and finite variance $\sigma^2$ and let the sum of the first n random variables in the sequence

$$S_n = X_1 + X_2 + ... + X_n$$

A new random variable with zero mean and finite variance $\sigma^2$ is defined as

$$Z_n = \frac{S_n - n\mu}{\sigma\sqrt{n}}$$

Then the central limit theorem states as n approaches $\infty$

$$F_{Z_n}(z) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^z e^{\frac{-y}{2}}dy$$

$$\sigma(x)' = \sigma(x)(1 - \sigma(x))$$

$$f(g(x)) = f'(g(x)) \cdot g(x)$$

Say $f(x) = x^2$, and take the derivative of $$\sigma(f(x))$$

$$\sigma(f(x)) = \sigma(f(x))(1 - \sigma(f(x))) \cdot 2x$$

$z = (1 + e^x) \cdot 2x$ and $y = z^2$ take the derivative of

$$F(y) = e^y$$ with respect to x.

Say we have

$$f(x, y) = 2x + 30xy + x^y$$

Take the partial of $f$ with respect to $x$

$$\frac{\partial f}{\partial x} = 2 + 30y + (y - 1) x ^ y$$

When you take partial derivative you treat everything else as constants besides the variable you are taking the derivative with.

Gradient vector of a field vector $F$ will give the direction of greatest rate of increase at a particular point in space.

Say we have a function $F = f(x, y, z)$

$$\triangledown F = \text{grad}~F = (\frac{\partial f}{\partial x} (x, y, z), \frac{\partial f}{\partial y} (x, y, z), \frac{\partial f}{\partial z} (x, y, z)) $$

Take the gradient of $F = 2xy + e^{xy} + z$

Different waves of visualizing matrix products

Say you have

$$ A = \left(\begin{array}{cc} a & b\\ c & d \end{array}\right)$$

$$B = \left(\begin{array}{cc} e & f\\ g & h \end{array}\right) $$

$$detA = a_{11}C_{11} + a_{12}C_{12} + ... + a_{1n}C_{1n}$$

where $C_{xx}$ are the cofactor matrices.

A matrix is invertible if it's determinant is non-zero.

An invertible matrix has full rank.

$$B = \left(\begin{array}{cc} 1 & 2 & 3\\ 0 & 4 & 5 \\ 1 & 0 & 6\\ \end{array}\right) $$

Augment matrix method (on board):

Adjoint method

$$A^{-1} = \frac{1}{\text{det} ~A} (\text{cofactor}~ A)^T$$

$x$ is a eigenvector of matrix $A$ if it satisfies

$$Ax = \lambda x$$

where $\lambda$ is an eigenvalue.

Find eigen values using

$$\text{det} ~|A - \lambda I|$$

$$ A = \left(\begin{array}{cc} 0 & 1\\ -2 & -3 \end{array}\right)$$

Let $U$ be the eigenvector matrix of $B$

$$U = [v_1, v_2, .., v_n]$$

where $v_1, ..., v_n$ are eigenvectors then

$$B = U^{-1}\Lambda U$$

$$ A = \left(\begin{array}{cc} 2 & 3\\ 2 & 1 \end{array}\right)$$

$$\lambda_1 = 4$$

with eigenvector

$$ u_1 = \left(\begin{array}{cc} 3 \\ 2 \end{array}\right)$$

and

$$\lambda_2 = -1$$

with eigenvector

$$ u_1 = \left(\begin{array}{cc} -1 \\ 1 \end{array}\right)$$

Show that the covariance matrix is positive semidefinite. The covariance matrix is given by

$$\sum = E[(X - E[X])(X - E[X])^T]$$

We are mainly interested in convex functions, as the local minimum is the global minimum.

A function is convex if it's second derivative is always positive

$$\frac{d}{dx^2}f(x) \geq 0$$

In higher dimensions you have to show that the second derivative is positive semi definite.

This inequality can be used to show convexity of a function. It says that for convex functions a secant line lies above the actual graph.

$$f(tx_1 + (1 - t)x_2) \leq tf(x_1) + (1 - t)f(x_2)$$

or more generally

$$f(E[x]) \leq E[f(x)]$$

We can use jensens inequality for $|x|$:

$$f(\alpha x + \beta y) = |\alpha x + \beta y| \leq \alpha|x| + \beta|y|$$

The above follows from triangle inequality and hence using jenses theorem is convex.

Further,

$$\frac{d^2e^x}{dx^2} = e^x \geq 0$$

Since either function is convex, their sum must also be convex.