Midterm Review

By Sajad Darabi

February 9, 2018

I don't love studying. I hate studying. I like learning. Learning is beautiful. Natalie Portman


  • Machine Learning Concepts
  • Linear Regression
  • Linear Classifier
  • Matrix Calculus (on board)
  • Perceptron
  • ID3

Machine Learning Concepts

Types of Learning

Supervised learning: given a set of features $x_i$ and labels $y_i$ train a model to predict the label $y_{new}$ for an unseen feature $x_{new}$.

$$F_{model}: x_i \rightarrow y_i$$

Example: a dataset containing images of breast labelled as either cancerous or non-cancerous.

Unsupervised learning: given a set of features $x_i$ find patterns in the data and assign it to a class $y_c$. $$F_{model}: x_i \rightarrow y_c$$

Example: cluster a set of patients according to their weight, height, etc..

Learning Problems

Regression: Here the goal is to estimate a real valued variable $y \in \Re$ given an input vector $x_i$

Example: predict weight of person given height, gender, race, etc.

Classification: given an inpute feature $x_i$ categorize it and assign it a label $y \in Y:= \{y_1, ... y_k\}$

Example: classify a document as written in $\{english, french, spanish, ...\}$

Learning Formulation

Goal: to learn a function $f: x \rightarrow y$ to make prediction

$1$. Define a learning process.

  • supervised learning: define a loss function $l(x_i, y_i)$ which will encur some loss on bad predictions
  • unsupervised learning: learn a distribution of the data or an underlying structure.

$2$. Train the model using training data set

$3$. Make prediction using the trained model

  • hope it generalizes well...

Emperical Risk Minimization

A learning algorithm receives training set $S$, sampled from an unknown distribution $D$ and labeled by some target function $f$.

Our learning algorithm goal is to output a predictor $h_S: X \rightarrow Y$ (note that subscript here means that it depends on the training set used)

We can keep track of error on the training set using the loss function

$$L_s(h) = E_{x, y}L(y, h(x))$$

ERM Rule

This learning paradaigm, of coming up with a predictor $h$ that minimzes $L_s(h)$ is emperical risk minimization

What can go wrong?

  • overfitting

The learning algorithm can find a predictor that does very well on the training set but horribly on true world data.

What is the true objective?

  1. minimize error on training set?
  2. minize training error with regularization?
  3. minimize error on unseen examples?
  4. learn about them machines, before they take over?

true goal is to be able to generalize.

We use the ERM learning scheme, that we have something concrete and actionable to work with.

Linear Regression

$1$. This model assumes function takes the form

$$h(x) = \sum_i^n \theta_ix_i = \theta^Tx$$

$2$. Cost function:

$$J(\theta) = \frac{1}{2} \sum_i^m (h_\theta(x_i) - y_i)^2 \\ = \frac{1}{2} (h_\theta - Y)^T(h_\theta - Y)$$

Find the exact solution to the cost function

$$J(\theta) = \frac{1}{2} \sum_i^m (h_\theta(x_i) - y_i)^2$$

Perceptron Online Learning Algorithm

Iterate over data points $(x_i, y_i)$

  1. Pick a sample point $x_i$

  2. Compute $a = \beta^T x_i$

  3. if $y_i(a) > 0$ do nothing

  4. else update $\beta^{new} \leftarrow \beta^{old} + y_ix_i$

Until convergence criteria has been met.

Decision Tree

Entropy is the uncertainty in your random varaible $X$. It is defined as follows:

$$H(X) = -\sum_{x \in X} p(x)log(p(x))$$

Units are nats for $log_e$, bits for $log_2$, and dits for log_10.

Mutual Information

Mutual information is a measure of entropy after observing a random variable. It can reduce the entropy or leave it unchanged.

$$H(X|Y) = -\sum_{x\in X}p(x~|~y)log(p(x~|~y))$$

ID3 Algorithm

  1. Calculate the overall entropy $H(Y)$.
  2. Calculate the entropy after doing a split on a particular attribute $A_1$.

$$H_A(D) = \sum^v \frac{|D_j|}{|D|} H(D_j)$$

Where $v$ is going over all the possible values of that attribute. For example $Humidity = \{low, medium, high\}$ there are three values. $D_j$ is the number of data points with that particular value for that attribute.

$3$. Choose the one with highest information gain (you can also take the one with lowest entropy after the split, i.e. the one with lowest $H_A(D)$).

$$Gain(A) = H(D) - H_A(D)$$

$4$. Repeat until termination cireteria is satisified (when too few instances remain in a branch, or you can reliably choose one class over the toher, etc..).

To keep in mind

  • ID3 can overfit to training data. Smaller tree could generalize better.
  • ID3 is best used on classification data
  • ID3 algorithm is greedy, and hence might not return the best solution.

That's it for now :)