# Content¶

• Machine Learning Concepts
• Maximum Likelihood
• Decision Tree Algorithm
• Scikit Learn Decision Tree Classifier

# Machine Learning Concepts¶

## Definitions and Terminology¶

• Example: a particular instance of data
• Features: set of attributes, represented as a vector $x_i$
• Labels:
• in classification category associaetd with an example
• in regression a real valued number associated with an example
• Training data: data used for training an algorithm
• Test data: data used for evaluating the model trained

## Steps to ML¶

Steps to creating a ML model that can be used for making inference. Creating a reasonable ML model is an iterative process as indicated by the feedback loop from step 5 to step 3.
1. In step 1 we collect the data of interest, e.g. cats & dogs, vital signals, etc.
2. Data collected does not necessarily always come in a way that is convinient to be fed to a ML model. It must first be pre-processed and cleaned prior to using it for training and testing.
3. Once data is ready a ML model is chosen given the goal we are trying to achieve (classifcation, regression, etc..). Then it is trained using training data.
4. The trained model is then tested using test data in which the model has not been exposed to before.
5. Improve the model if it does not meet requirements.

## Types of Learning¶

Supervised learning: given a set of features $x_i$ and labels $y_i$ train a model to predict the label $y_{new}$ for an unseen feature $x_{new}$.

$$F_{model}: x_i \rightarrow y_i$$

Example: a dataset containing images of breast labelled as either cancerous or non-cancerous.

Unsupervised learning: given a set of features $x_i$ find patterns in the data and assign it to a class $y_c$. $$F_{model}: x_i \rightarrow y_c$$

Example: cluster a set of patients according to their weight, height, etc..

## Learning Problems¶

Regression: Here the goal is to estimate a real valued variable $y \in \Re$ given an input vector $x_i$

Example: predict weight of person given height, gender, race, etc.

Classification: given an inpute feature $x_i$ categorize it and assign it a label $y \in Y:= \{y_1, ... y_k\}$

Example: classify a document as written in $\{english, french, spanish, ...\}$

## Types of Classifiers¶

1. Instance based classification
• memorize the training data and use it to as an observation for classifying new examples
1. Generative:
• build a statistical model that models the underlying system that generated the examples (i.e. learn a statistical model)
1. Discrimantive
• estimate a decision rule or boundary that splits the examples into different regions corresponding to the different classes.

## Learning Formulation¶

Goal: to learn a function $f: x \rightarrow y$ to make prediction

1. Define a learning process.
• supervised learning: define a loss function $l(x_i, y_i)$ which will encur some loss on bad predictions
• unsupervised learning: learn a distribution of the data or an underlying structure.
2. Train the model using training data set
3. Make prediction using the trained model
• hope it generalizes well...

## Features are Important¶

Choosing the right features to be used for a particular learning task could be quite a hassle!

• bad features will be uncorrelated with labels making it hard for the algorithm to learn
• good features can correlate and be effective for the model.

## Generalization¶

Is synonymous to not memorizing the training dataset.

• We are interested in the performance of the model on unseen data
• Minimizing error on training set does not gaurantee good generalization

You can overfit your model to the training set if your hypothesis is too complex!

Or underfit if your model is too simple.

Occam's Razor: Given a simpler model that fits the data adequatly relative to a more complex model, the simpler model should be chosen as it makes less assumptions about the underlying system.

# Maximum Likelihood Estimation¶

If you recall the concept of an estimator was the act of choosing a parameter

$$\hat{\theta} \in \Theta$$

Which was our best guess of the true underlying parameters of our model.

For example the sample mean

$$M_n = \frac{1}{n}\sum^n x_n$$

We evaluated this estimator using some definition estimation error.

• squared error $$L(\hat{\theta}, \theta_{t}) = ||\hat{\theta} - \theta_t||^2$$

• absolute error $$L(\hat{\theta}, \theta_{t}) = |\hat{\theta} - \theta_t|^2$$

We can think of the parameter $\hat{\theta}$ as a random variable.

It's expectation is then called the statistical risk

$$R(\hat{\theta}) = E[L(\hat{\theta}, \theta_t)]$$

To use these estimation methods we needed

• a sample data $D$ that can be used to make statements about the underlying probability distribution
• $D$ is a realization of the complete data $D_{c}$

We can think of the sample as a realization of some random vector which has an unknown join distribution function with parameters $\theta$.

We are often interesed in the true parameter $\theta$ that generated the sample dataset in the first place, but we can at best come up with a way to estimate it in a reasonable way.

## MLE¶

In MLE is a method of estimating model parameters by maximizing the likelihood function.

The liklihood of observing our dataset given the parameter models is given by

$$L(\theta; D) = P(D|\theta) = P((x_1, y_1), (x_2, y_2), ... (x_n, y_n)| \theta) = \prod_{k = 1}^np((x_k, y_k)|\theta)$$

Goal: Determine the parametrs $\theta$ of our model.

We can achieve this by finding the $\theta$ that maximizes the likelihood function:

$$\hat{\theta} = argmax_{\theta}~L(\theta)$$

## Assumptions¶

When using MLE to estimate our model parameters we make the following assumptions (not all are listed here)

• each example sequence $\{x_i\}$ is IID
• the log-likelihood function is twice differentiable to $\theta$ in a neihbourhood of the "true" $\theta_t$

## Log Probabilities¶

The likelihood function is a product of probabilities and it is important to note that probabilities are $0 \leq p(x) \leq 1$

• their product will result in extermly small values!

To avoid underflow, instead we can use the $log(L(\theta))$, as follows:

$$log(L(\theta)) = log(\prod^n p(x_k)|\theta)) = \sum^n log(p(x_k)|\theta)$$

• Maximizing this function is the same as maximizing the $L(\theta)$ function as $log$ is a strictly increasing function.
• log space also makes it more convinient $\prod \rightarrow \sum$ for derivatives!

## Example 1¶

### For $x_1, ..., x_n$ iid Poisson random variables find an estimate for $\lambda$ using the MLE method.¶

$$P(X = x) = \frac{\lambda^xe^{-\lambda}}{x!}$$

## Example 2¶

### For $x_1, ..., x_n$ iid Normal random variables find an estimate for $\mu$ and $\sigma$ using the MLE method¶

$$N(\mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}}e^{(-\frac{1}{2}[\frac{x - \mu}{\sigma}]^2)}$$

# Entropy¶

## Information¶

Entropy is a foundational concept in information theory.

Information can be thought to be stored in a variable that can take on different values.

We get information by looking at the value of that variable, just the way we get information by going to the next slide and reading it's content..

The entropy defined in information theory is related to the entropy in mechanical systems:

$$H(X) = -\sum_{x \in X} p(x)log(p(x))$$

• it is the uncertainty of a random variable. In this case the random variable $X$.

It measures the randomness contained in that random variable.

• The higher the entropy the harder to draw conclusions from the outcome of a random variable.

## Conditional Entropy¶

We can define conditional Entropy

$$H(X|Y) = -\sum_{x\in X}p(x|y)log(p(x|y))$$

which is the randomness in a random variable given knowledge of the state of another random variable.

if the base of the log used is $e$, or $2$ then the units for $H(x)$ is nats and bits respictevely

## Definition of Entropy¶

Entropy, the way it's defined is derived in different ways; a common one is the axiomatic approach in which we want it to have certain properties, you can look here for details.

You can also think of entropy as the expected value of $\frac{1}{p(x)}$

$$E[\frac{1}{p(x)}] = \sum p(x)log(\frac{1}{p(x)})$$

## Mutual Information¶

Quantifies the amount of uncertainty removed upon obtaining information on an instance of a variable.

It is calculated using

\begin{align} I(X; Y) &= \sum_{x, y}p(x, y)log\frac{p(x, y)}{p(x)p(y)}\\ &= H(X) - H(X|Y)\\ &= H(Y) - H(Y|X) \end{align}

This is also called the information gain upon observing $Y$.

# Decision Trees¶

Decision tree is a discriminative classifier

• it estimates a decision rule/boundary amongst examples

• an intuitive classifier

• easy to understand, construct and visualize

Given it's simplicity it actually works very well in practice!

## The Algorithm¶

We use splits on a particular attribute to split the feature space region in a way that the leaf would represent a certain class.

The splitting criteria can be different depending on the type of algorithm used.

## Id3 Algorithm¶

Use entropy

• a measure of uncertainty (impurity) associated with the attribute

The uncertainty at a node to classify an instance is given by

$$H(D) = -\sum^np_ilog(p_i)$$

We can potentially reduce this uncertainty by splitting that node using an attribute.

$$H_A(D) = \sum^v \frac{|D_j|}{|D|} H(D_j)$$

Which is the weighted average of the entropies, after the split.

The criteria we use in ID3 is information gained after splitting attribute A.

$$Gain(A) = H(D) - H_A(D)$$

## Build the tree for the following using ID3¶

 Day Outlook Temperature Humidity Wind Play ball D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No

# Scikit Learn Descision Tree Classifier¶

We will build a decision tree classifier using scikit-learn library and the iris dataset.

You can check out the tutorial here.

## Visualizing¶

A decision tree classifier splits the data according to various impurity criterias. Here we use the ID3 algorithm, which uses mutual information (information gain) as a criterion for choosen splits at a particular node.

We can visualize how different attributes split the data using a histogram plot.

In [2]:
from sklearn.datasets import load_iris
iris.feature_names # 4 features

Out[2]:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
In [3]:
list(iris.target_names) # 3 classes

Out[3]:
['setosa', 'versicolor', 'virginica']

## Baseline Classifier¶

Prior to training a decision tree classifier we need a baseline classifier to compare to. The DummyClassifier is useful for this purpose.

Let's fit a DummyClassifier with the iris dataset.

In [4]:
from sklearn.dummy import DummyClassifier

random_classifier = DummyClassifier()
clf = random_classifier.fit(iris.data, iris.target)

In [5]:
clf.class_prior_

Out[5]:
array([ 0.33333333,  0.33333333,  0.33333333])

Let's test our random classifier on the training set.

In [6]:
n_correct = sum((clf.predict(iris.data) - iris.target) == 0)
accuracy = 100 *  n_correct / len(iris.target)
accuracy

Out[6]:
34.0

## Building the Decision Tree Classifier¶

In [7]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
X_train, X_test,y_train, y_test = train_test_split(iris.data,
iris.target,
test_size=0.2,
random_state=0)

In [8]:
X_train.shape, y_train.shape

Out[8]:
((120, 4), (120,))
In [9]:
clf = DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

Out[9]:
1.0