$1$. This model assumes function takes the form

$$h(x) = \sum_i^n \theta_ix_i = \theta^Tx$$

$2$. Cost function:

$$J(\theta) = \frac{1}{2} \sum_i^m (h_\theta(x_i) - y_i)^2 \\ = \frac{1}{2} (h_\theta - Y)^T(h_\theta - Y)$$

$\alpha = y^TAx \rightarrow \frac{\partial \alpha}{\partial x} = y^T A$

$\alpha = y^TAx \rightarrow \frac{\partial \alpha}{\partial y} = x^T A^T$

$\alpha = x^TAx \rightarrow \frac{\partial \alpha}{\partial x} = x^T(A + A^T)$

$$J(\theta) = \frac{1}{2} \sum_i^m (h_\theta(x_i) - y_i)^2$$

**Supervised learning**: given a set of features $x_i$ and labels $y_i$ train a model to predict the label $y_{new}$ for an unseen feature $x_{new}$.

$$F_{model}: x_i \rightarrow y_i$$

Example: a dataset containing images of breast labelled as either cancerous or non-cancerous.

**Unsupervised learning**: given a set of features $x_i$ find patterns in the data and assign it to a class $y_c$.
$$F_{model}: x_i \rightarrow y_c$$

Example: cluster a set of patients according to their weight, height, etc..

**Regression**: Here the goal is to estimate a real valued variable $y \in \Re$ given an input vector $x_i$

Example: predict weight of person given height, gender, race, etc.

**Classification**: given an inpute feature $x_i$ categorize it and assign it a label $y \in Y:= \{y_1, ... y_k\}$

Example: classify a document as written in $\{english, french, spanish, ...\}$

**Goal**: to learn a function $f: x \rightarrow y$ to make prediction

- Define a learning process.
- supervised learning: define a loss function $l(x_i, y_i)$ which will encur some loss on bad predictions
- unsupervised learning: learn a distribution of the data or an underlying structure.

- Train the model using training data set
- Make prediction using the trained model
- hope it generalizes well...

In MLE is a method of estimating model parameters by maximizing the likelihood function.

The liklihood of observing our dataset given the parameter models is given by

$$ L(\theta; D) = P(D|\theta) = P((x_1, y_1), (x_2, y_2), ... (x_n, y_n)| \theta) = \prod_{k = 1}^np((x_k, y_k)|\theta)$$

**Goal**: Determine the parametrs $\theta$ of our model.

We can achieve this by finding the $\theta$ that **maximizes** the likelihood function:

$$\hat{\theta} = argmax_{\theta}~L(\theta)$$

Entropy is the uncertainty in your random varaible $X$. It is defined as follows:

$$H(X) = -\sum_{x \in X} p(x)log(p(x))$$

Units are **nats** for $log_e$, **bits** for $log_2$, and dits for **log_10**.

Mutual information is a measure of entropy after observing a random variable. It can reduce the entropy or leave it unchanged.

$$H(X|Y) = -\sum_{x\in X}p(x~|~y)log(p(x~|~y))$$

- Calculate the overall entropy $H(Y)$.
- Calculate the entropy after doing a split on a particular attribute $A_1$.

$$H_A(D) = \sum^v \frac{|D_j|}{|D|} H(D_j)$$

Where $v$ is going over all the possible values of that attribute. For example $Humidity = \{low, medium, high\}$ there are three values. $D_j$ is the number of data points with that particular value for that attribute.

$3$. Choose the one with highest information gain (you can also take the one with lowest entropy after the split, i.e. the one with lowest $H_A(D)$).

$$Gain(A) = H(D) - H_A(D)$$

$4$. Repeat until termination cireteria is satisified (when too few instances remain in a branch, or you can reliably choose one class over the toher, etc..).

- ID3 can
**overfit**to training data. Smaller tree could generalize better. - ID3 is best used on classification data
- ID3 algorithm is greedy, and hence might not return the best solution.

Iterate over data points $(x_i, y_i)$

Pick a sample point $x_i$

Compute $a = \beta^T x_i$

if $y_i(a) > 0$ do nothing

else update $\beta^{new} \leftarrow \beta^{old} + y_ix_i$

Until convergence criteria has been met.

Sigmoid function:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Derivative: $$ \sigma'(z)= \sigma(z)\cdot (1 - \sigma(z))$$

Likelihood function:

$$L( w) = \prod_i p(y = 1 ~ | ~ x_i)^{y_i}p(y = 0 ~ | ~ x_i)^{1 - y_i}$$

cross-entropy loss function: $$ -logL(w) = -\sum_i y_i\text{log}(\sigma(x)) + (1 - y_i)\text{log}(1 - \sigma(x))$$

Gradient descent is a popular optimization algorithm used to find minima/maxima of cost functions in our learning models.

Given a function $f: \Re^d \rightarrow \Re$, gradiet descent uses the observation that the gradient

$$\nabla~f$$

Gives the direction of steepest ascent. If one goes in the opposite direction of $\nabla~f$ then we get the steepest descent.

$1.$ Second order condition:

$$\mathcal{H} = \nabla^2f(x)_{ij} = \frac{\partial^2f(x)}{\partial x_i \partial x_j}$$

for $i, j = 1, ..., n$ exists and

$$\mathcal{H} \geq 0$$

You can show that the hessian is positive by showing it is positive-semi-definite:

$$v^T\mathcal{H}v \geq 0$$

$2.$ Use Jensen's inequality:

$$f(E[x]) \leq E[f(x)]$$