# Nearest Neighbor Classifier¶

Say we want to classify an image.

We can compare the images using a notion of similarity.

• L1 distance $$d_1(I_1, I_2) = \sum_p |I_1^p - I_2^p|$$

• L2 distance $$d_2(I_1, I_2) = \sqrt{\sum_p (I_1^p - I_2^p)^2}$$

Difference between L1 and L2?

• L2 distance penalizes large mistakes much more than L1
• L2 preferes medium sized disagreements rather than one big mistake (i.e. pixels that are uniformly close to other image)

# The Algorithm¶

Training: to train the NN classifier store all training examples

• fast training time
• large memory overhead versus parametric models. (image dataset could be huge!)
• largely depends on training data.

Predection: find closest training example to $x$ and give the test examples the corresponding label of $y_{closest~example}$

• very slow! at test time
• O(nd) where $d$ is the dimension and $n$ is the training example.

# Perceptrons¶

A perceptron is a single unit that has two parts:

Given a feature vector $x \in \mathbb{R^d}$ as the input to our perceptron.

1. We take the weighted sum of the inputs $\sum^d_i w_ix_i$
2. Pass it to non-linear activation function in this case it is a step function $sign(\sum^d_i w_ix_i)$

## Linear Classifier¶

The perceptron is a used to classify two classes $-1$ or $1$. It is a linear classifier as the classification decision is based on the value of a linear combination of the feature vector $x$.

The classes are given by the activation function

$$y = sign(\sum^d_i w_ix_i)$$

We want to set the weights in a way that the output would predict the label of the class correctly.

## Training Perceptron¶

Before the perceptron model first came out, they would randomly change the weights until it works.

Now, we can use the general procedure for ML:

$1$. Model (hypothesis class) $$h(x) \in \mathcal{H}$$

We have to make an assumption (no free lunch theorem), in this case $h(x) = w^Tx$.

$2$. Score Criterion (Cost function): $$S(h) = E_{x, y}L(y, h(x))$$

How are we going to evaluate whether this function is good or bad?

• Define a cost function $\phi(w, b)$ as a summation of the distance between all missclassified points and the hyperplane.

$3$. Search strategy

$$\hat{h} = \text{arg min}_{f \in \mathcal{F}}S(f)$$

We have many different hypothesis in our search space, the search strategy uses our score criterion as a guide to choose one.

• minimize the cost function, by estimating $w, b$, $min_{\beta, b}~\phi(\beta, b) = \{\text{distance of all misclassified points}\}$

## Slight Deviation: Geometry¶

Hyperplane with two points that satisfy the plane equation

## Show that $\beta$ is perpendicular to the hyperplane¶

Points that satisfy the plane equation $\beta^Tx + b = 0$

$$\beta^Tx_1 + b = 0$$ $$\beta^Tx_2 + b = 0$$

$$\beta^Tx_1 + b = \beta^Tx_2 + b$$

$$\beta^Tx_1 = \beta^Tx_2$$

$$\beta^T(x_2 - x_1) = \beta^T\vec{v} = 0$$

The dot product of the weight vector $\beta$ and v is zero, hence $\beta$ is $\perp$ to the plane.

For any point $x_0$ on the plane we have

$$\beta^T x_0 + b = 0$$

So we have

$$b = -\beta^T x_0$$

note: $\beta$ is the unit normal vector to the plane.

## Compute the distance of a point to the hyperplane¶

Find the distance d to the hyperplane
</div> We want to compute the distance of the $x$ to the hyperplane.

It is easy to compute the distance to an arbitrary point $x_0$ on the plane we can just take the difference and take the norm.

$$||x-x_0||$$.

To get the minimal distance we need to project it in the direction of $\beta$. We do this by taking the dot product with $\hat{\beta} = \frac{\beta}{|\beta|}$

$$d = \hat{\beta}^T(x-x_0) = \hat{\beta}^Tx + b$$

Note that depending on whether x is pointing in the same or opposite direction of $\beta$ the sign of the distance will be different.

## Back to Cost function¶

In the case of missclassifying we want our distance to always be positive. We can do this by multiplying by the class label.

$$d_i = y_i (\beta x_i + b)$$

This will give us a positive distance if our $(\beta x_i + b)$ sign agrees with $y_i$.

note: we can also take the absolute value, but it isn't differentiable. We want our cost function to be differentiable.

We can now write our cost function as

$$\phi(\beta, b) = \sum_{i \in M} -y_i(\beta x_i + b)$$

Which is the summation over all examples that we have missclassified so far.

note: the negative sign is there so that we make our distance positive, as we have miss classified the example (it is on the wrong side of the plane)

## Online Learning Algorithm¶

Iterate over data points $(x_i, y_i)$

1. Pick a sample point $x_i$
1. Compute $a = \beta^T x_i$
1. if $y_i(a) > 0$ do nothing
1. else update $\beta^{new} \leftarrow \beta^{old} + y_ix_i$

Until convergence criteria has been met.

## Linear Seperability¶

If there is a hyperplane that seperates the two classes, then the dataset is said to be linearly seperable.

Linearly Seperable
Not linearly seperable

## Properties of Perceptron¶

• If the data is linearly seperable the solution is theoretically guaranteed to converge to a hyperplane with perfect accuracy in finite number of steps. See proof here.
• If there are no seperating hyperplane it isn't gauranteed of course.
• Perceptron can discriminate only between two classes $\{-1, 1\}$
• For linearly seperable there are infinite number of solutions
• Even though there might be solution the number of steps required to reach that solution can be very large and infeasible.
• In general the smaller the gap between the classes the harder to reach the solution.