- Machine Learning Concepts
- Maximum Likelihood
- Decision Tree Algorithm
- Scikit Learn Decision Tree Classifier

**Example**: a particular instance of data**Features**: set of attributes, represented as a vector $x_i$**Labels**:- in classification category associaetd with an example
- in regression a real valued number associated with an example

**Training data**: data used for training an algorithm**Test data**: data used for evaluating the model trained

- In step 1 we collect the data of interest, e.g. cats & dogs, vital signals, etc.

- The trained model is then tested using test data in which the model has not been exposed to before.
- Improve the model if it does not meet requirements.

**Supervised learning**: given a set of features $x_i$ and labels $y_i$ train a model to predict the label $y_{new}$ for an unseen feature $x_{new}$.

$$F_{model}: x_i \rightarrow y_i$$

Example: a dataset containing images of breast labelled as either cancerous or non-cancerous.

**Unsupervised learning**: given a set of features $x_i$ find patterns in the data and assign it to a class $y_c$.
$$F_{model}: x_i \rightarrow y_c$$

Example: cluster a set of patients according to their weight, height, etc..

**Regression**: Here the goal is to estimate a real valued variable $y \in \Re$ given an input vector $x_i$

Example: predict weight of person given height, gender, race, etc.

**Classification**: given an inpute feature $x_i$ categorize it and assign it a label $y \in Y:= \{y_1, ... y_k\}$

Example: classify a document as written in $\{english, french, spanish, ...\}$

- Instance based classification
- memorize the training data and use it to as an observation for classifying new examples

- Generative:
- build a statistical model that models the underlying system that generated the examples (i.e. learn a statistical model)

- Discrimantive
- estimate a decision rule or boundary that splits the examples into different regions corresponding to the different classes.

**Goal**: to learn a function $f: x \rightarrow y$ to make prediction

- Define a learning process.
- supervised learning: define a loss function $l(x_i, y_i)$ which will encur some loss on bad predictions
- unsupervised learning: learn a distribution of the data or an underlying structure.

- Train the model using training data set
- Make prediction using the trained model
- hope it generalizes well...

Choosing the right features to be used for a particular learning task could be quite a hassle!

- bad features will be uncorrelated with labels making it hard for the algorithm to learn

- good features can correlate and be effective for the model.

Is synonymous to not memorizing the training dataset.

- We are interested in the performance of the model on unseen data
- Minimizing error on training set does not gaurantee good generalization

You can overfit your model to the training set if your hypothesis is too complex!

Or underfit if your model is too simple.

**Occam's Razor**: Given a simpler model that fits the data adequatly relative to a more complex model, the simpler model should be chosen as it makes less assumptions about the underlying system.

Entropy is a foundational concept in information theory.

Information can be thought to be stored in a variable that can take on different values.

We get information by looking at the value of that variable, just the way we get information by going to the next slide and reading it's content..

The entropy defined in information theory is related to the entropy in mechanical systems:

$$H(X) = -\sum_{x \in X} p(x)log(p(x))$$

- it is the uncertainty of a random variable. In this case the random variable $X$.

It measures the randomness contained in that random variable.

- The higher the entropy the harder to draw conclusions from the outcome of a random variable.

We can define conditional Entropy

$$H(X|Y) = -\sum_{x\in X}p(x~|~y)log(p(x~|~y))$$

**if the base of the log used is $e$, or $2$ then the units for $H(x)$ is nats and bits respictevely**

You can also think of entropy as the expected value of $\frac{1}{p(x)}$

$$E[\frac{1}{p(x)}] = \sum p(x)log(\frac{1}{p(x)})$$

Quantifies the amount of uncertainty removed upon obtaining information on an instance of a variable.

It is calculated using

\begin{align} I(X; Y) &= \sum_{x, y}p(x, y)log\frac{p(x, y)}{p(x)p(y)}\\ &= H(X) - H(X|Y)\\ &= H(Y) - H(Y|X) \end{align}

This is also called the information gain upon observing $Y$.

Decision tree is a discriminative classifier

it estimates a decision rule/boundary amongst examples

an intuitive classifier

- easy to understand, construct and visualize

Given it's simplicity it actually works very well in practice!

- We use splits on a particular attribute to split the feature space region in a way that the leaf would represent a certain class.

- The splitting criteria can be different depending on the type of algorithm used.

In ID3 algorith we decide which atribute to split on based on entropy

- a measure of uncertainty (impurity) associated with the attribute

The uncertainty at a node to classify an instance is given by

$$H(D) = -\sum^np_ilog(p_i)$$

We can potentially reduce this uncertainty by splitting that node using an attribute.

$$H_A(D) = \sum^v \frac{|D_j|}{|D|} H(D_j)$$

Which is the weighted average of the entropies, after the split.

The criteria we use in ID3 is information gained after splitting attribute A.

$$Gain(A) = H(D) - H_A(D)$$

Day | Outlook | Temperature | Humidity | Wind | Play ball |

D1 | Sunny | Hot | High | Weak | No |

D2 | Sunny | Hot | High | Strong | No |

D3 | Overcast | Hot | High | Weak | Yes |

D4 | Rain | Mild | High | Weak | Yes |

D5 | Rain | Cool | Normal | Weak | Yes |

D6 | Rain | Cool | Normal | Strong | No |

D7 | Overcast | Cool | Normal | Strong | Yes |

D8 | Sunny | Mild | High | Weak | No |

D9 | Sunny | Cool | Normal | Weak | Yes |

D10 | Rain | Mild | Normal | Weak | Yes |

D11 | Sunny | Mild | Normal | Strong | Yes |

D12 | Overcast | Mild | High | Strong | Yes |

D13 | Overcast | Hot | Normal | Weak | Yes |

D14 | Rain | Mild | High | Strong | No |