Say we want to classify an image.

We can compare the images using a notion of similarity.

- L1 distance $$d_1(I_1, I_2) = \sum_p |I_1^p - I_2^p|$$

- L2 distance $$d_2(I_1, I_2) = \sqrt{\sum_p (I_1^p - I_2^p)^2}$$

Difference between L1 and L2?

- L2 distance penalizes large mistakes much more than L1
- L2 preferes medium sized disagreements rather than one big mistake (i.e. pixels that are uniformly close to other image)

**Training:** to train the NN classifier store all training examples

- fast training time
- large memory overhead versus parametric models. (image dataset could be huge!)
- largely depends on training data.

**Predection:** find closest training example to $x$ and give the test examples the corresponding label of $y_{closest~example}$

- very slow! at test time
- O(nd) where $d$ is the dimension and $n$ is the training example.

A perceptron is a single unit that has two parts:

Given a feature vector $x \in \mathbb{R^d}$ as the input to our perceptron.

- We take the weighted sum of the inputs $\sum^d_i w_ix_i$
- Pass it to non-linear activation function in this case it is a step function $sign(\sum^d_i w_ix_i)$

The perceptron is a used to classify two classes $-1$ or $1$. It is a linear classifier as the classification decision is based on the value of a **linear combination** of the feature vector $x$.

The classes are given by the activation function

$$y = sign(\sum^d_i w_ix_i)$$

We want to set the weights in a way that the output would predict the label of the class correctly.

Before the perceptron model first came out, they would randomly change the weights until it works.

Now, we can use the general procedure for ML:

$1$. Model (hypothesis class) $$h(x) \in \mathcal{H}$$

We have to make an assumption (no free lunch theorem), in this case $h(x) = w^Tx$.

$2$. Score Criterion (Cost function): $$S(h) = E_{x, y}L(y, h(x))$$

How are we going to evaluate whether this function is good or bad?

- Define a cost function $\phi(w, b)$ as a summation of the distance between all missclassified points and the hyperplane.

$3$. Search strategy

$$\hat{h} = \text{arg min}_{f \in \mathcal{F}}S(f)$$

We have many different hypothesis in our search space, the search strategy uses our score criterion as a guide to choose one.

Points that satisfy the plane equation $\beta^Tx + b = 0$

$$\beta^Tx_1 + b = 0$$ $$\beta^Tx_2 + b = 0$$

$$\beta^Tx_1 + b = \beta^Tx_2 + b$$

$$\beta^Tx_1 = \beta^Tx_2$$

$$\beta^T(x_2 - x_1) = \beta^T\vec{v} = 0$$

The dot product of the weight vector $\beta$ and v is zero, hence $\beta$ is $\perp$ to the plane.

For any point $x_0$ on the plane we have

$$\beta^T x_0 + b = 0 $$

So we have

$$b = -\beta^T x_0$$

**note**: $\beta$ is the unit normal vector to the plane.

It is easy to compute the distance to an arbitrary point $x_0$ on the plane we can just take the difference and take the norm.

$$||x-x_0||$$.

To get the minimal distance we need to project it in the direction of $\beta$. We do this by taking the dot product with $\hat{\beta} = \frac{\beta}{|\beta|}$

$$d = \hat{\beta}^T(x-x_0) = \hat{\beta}^Tx + b$$

Note that depending on whether x is pointing in the same or opposite direction of $\beta$ the sign of the distance will be different.

In the case of missclassifying we want our distance to always be positive. We can do this by multiplying by the class label.

$$d_i = y_i (\beta x_i + b)$$

This will give us a positive distance if our $ (\beta x_i + b)$ sign agrees with $y_i$.

**note**: we can also take the absolute value, but it isn't differentiable. We want our cost function to be differentiable.

We can now write our cost function as

$$\phi(\beta, b) = \sum_{i \in M} -y_i(\beta x_i + b)$$

Which is the summation over all examples that we have missclassified so far.

**note**: the negative sign is there so that we make our distance positive, as we have miss classified the example (it is on the wrong side of the plane)

- Compute $a = \beta^T x_i$

- if $y_i(a) > 0$ do nothing

- else update $\beta^{new} \leftarrow \beta^{old} + y_ix_i$

Until convergence criteria has been met.

If there is a hyperplane that seperates the two classes, then the dataset is said to be linearly seperable.

- If the data is linearly seperable the solution is theoretically guaranteed to converge to a hyperplane with perfect accuracy in finite number of steps. See proof here.
- If there are no seperating hyperplane it isn't gauranteed of course.
- Perceptron can discriminate only between two classes $\{-1, 1\}$
- For linearly seperable there are infinite number of solutions
- Even though there might be solution the number of steps required to reach that solution can be very large and infeasible.
- In general the smaller the gap between the classes the harder to reach the solution.