- Machine Learning Concepts
- Linear Regression
- Linear Classifier
- Matrix Calculus (on board)
- Perceptron
- ID3

**Supervised learning**: given a set of features $x_i$ and labels $y_i$ train a model to predict the label $y_{new}$ for an unseen feature $x_{new}$.

$$F_{model}: x_i \rightarrow y_i$$

Example: a dataset containing images of breast labelled as either cancerous or non-cancerous.

**Unsupervised learning**: given a set of features $x_i$ find patterns in the data and assign it to a class $y_c$.
$$F_{model}: x_i \rightarrow y_c$$

Example: cluster a set of patients according to their weight, height, etc..

**Regression**: Here the goal is to estimate a real valued variable $y \in \Re$ given an input vector $x_i$

Example: predict weight of person given height, gender, race, etc.

**Classification**: given an inpute feature $x_i$ categorize it and assign it a label $y \in Y:= \{y_1, ... y_k\}$

Example: classify a document as written in $\{english, french, spanish, ...\}$

**Goal**: to learn a function $f: x \rightarrow y$ to make prediction

$1$. Define a learning process.

**supervised learning**: define a loss function $l(x_i, y_i)$ which will encur some loss on bad predictions**unsupervised learning**: learn a distribution of the data or an underlying structure.

$2$. Train the model using training data set

$3$. Make prediction using the trained model

- hope it generalizes well...

A learning algorithm receives training set $S$, sampled from an unknown distribution $D$ and labeled by some target function $f$.

We can keep track of error on the training set using the loss function

$$L_s(h) = E_{x, y}L(y, h(x))$$

This learning paradaigm, of coming up with a predictor $h$ that minimzes $L_s(h)$ is emperical risk minimization

What can go wrong?

**overfitting**

The learning algorithm can find a predictor that does very well on the training set but horribly on true world data.

- minimize error on training set?
- minize training error with regularization?
- minimize error on unseen examples?
- learn about them machines, before they take over?

**true goal** is to be able to generalize.

We use the ERM learning scheme, that we have something concrete and actionable to work with.

$1$. This model assumes function takes the form

$$h(x) = \sum_i^n \theta_ix_i = \theta^Tx$$

$2$. Cost function:

$$J(\theta) = \frac{1}{2} \sum_i^m (h_\theta(x_i) - y_i)^2 \\ = \frac{1}{2} (h_\theta - Y)^T(h_\theta - Y)$$

$$J(\theta) = \frac{1}{2} \sum_i^m (h_\theta(x_i) - y_i)^2$$

Iterate over data points $(x_i, y_i)$

Pick a sample point $x_i$

Compute $a = \beta^T x_i$

if $y_i(a) > 0$ do nothing

else update $\beta^{new} \leftarrow \beta^{old} + y_ix_i$

Until convergence criteria has been met.

Entropy is the uncertainty in your random varaible $X$. It is defined as follows:

$$H(X) = -\sum_{x \in X} p(x)log(p(x))$$

Units are **nats** for $log_e$, **bits** for $log_2$, and dits for **log_10**.

Mutual information is a measure of entropy after observing a random variable. It can reduce the entropy or leave it unchanged.

$$H(X|Y) = -\sum_{x\in X}p(x~|~y)log(p(x~|~y))$$

- Calculate the overall entropy $H(Y)$.
- Calculate the entropy after doing a split on a particular attribute $A_1$.

$$H_A(D) = \sum^v \frac{|D_j|}{|D|} H(D_j)$$

$3$. Choose the one with highest information gain (you can also take the one with lowest entropy after the split, i.e. the one with lowest $H_A(D)$).

$$Gain(A) = H(D) - H_A(D)$$

- ID3 can
**overfit**to training data. Smaller tree could generalize better. - ID3 is best used on classification data
- ID3 algorithm is greedy, and hence might not return the best solution.