Linear Models

By Sajad Darabi

October 27, 2017

No matter how bad things are, you can always make things worse. ― Randy Pausch


  • Gradient Descent
  • Convex Functions
  • Perceptrons
  • Linear Regression
  • Logistic Regression


Gradient Descent

Gradient descent is a popular optimization algorithm used to find minima/maxima of cost functions in our learning models.

Convex Functions

A function $f : \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if it satisfies

$$f(tx_1 + (1 - t)x_2) \leq tf(x_1) + (1 - t)f(x_2)$$

It is also known as Jensen's inequality. It says that for convex functions a secant line lies above the actual graph.

Checking for Convexity

Use Jensen's inequality:

$$f(E[x]) \leq E[f(x)]$$

Second order condition:

$$\mathcal{H} = \nabla^2f(x)_{ij} = \frac{\partial^2f(x)}{\partial x_i \partial x_j}$$

for $i, j = 1, ..., n$ exists and

$$\mathcal{H} \geq 0$$

You can show that the hessian is positive by showing it is positive-semi-definite:

$$v^T\mathcal{H}v \geq 0$$

Useful Matrix Derivatives

  • $\alpha = y^TAx \rightarrow \frac{\partial \alpha}{\partial x} = y^T A$

  • $\alpha = y^TAx \rightarrow \frac{\partial \alpha}{\partial y} = x^T A^T$

  • $\alpha = x^TAx \rightarrow \frac{\partial \alpha}{\partial x} = x^T(A + A^T)$

Show that the least square objective function is convex

$$f(x) = ||Ax - b||_2^2$$

Show the following function is convex

$$f(x) = e^x + |x|$$


A perceptron is a single unit that has two parts:

Given a feature vector $x \in \mathbb{R^d}$ as the input to our perceptron.

  1. We take the weighted sum of the inputs $\sum^d_i w_ix_i$
  2. Pass it to non-linear activation function in this case it is a step function $sign(\sum^d_i w_ix_i)$

Linear Classifier

The perceptron is a used to classify two classes $-1$ or $1$. It is a linear classifier as the classification decision is based on the value of a linear combination of the feature vector $x$.

The classes are given by the activation function

$$y = sign(\sum^d_i w_ix_i)$$

We want to set the weights in a way that the output would predict the label of the class correctly.

Training Perceptron

Before when the perceptron model first came out, they would randomly change the weights until it works.

Now, we can use the general procedure for ML:

$1$. Model (hypothesis class) $$h(x) \in \mathcal{H}$$

We have to make an assumption (no free lunch theorem), in this case $h(x) = w^Tx$.

$2$. Score Criterion (Cost function): $$S(h) = E_{x, y}L(y, h(x))$$

How are we going to evaluate whether this function is good or bad?

  • Define a cost function $\phi(w, b)$ as a summation of the distance between all missclassified points and the hyperplane.

$3$. Search strategy

$$\hat{h} = \text{arg min}_{f \in \mathcal{F}}S(f)$$

We have many different hypothesis in our search space, the search strategy uses our score criterion as a guide to choose one.

  • minimize the cost function, by estimating $w, b$, $min_{\beta, b}~\phi(\beta, b) = \{\text{distance of all misclassified points}\}$

Slight Deviation: Geometry

Hyperplane with two points that satisfy the plane equation

Show that $\beta$ is perpendicular to the hyperplane

Points that satisfy the plane equation $\beta^Tx + b = 0$

$$\beta^Tx_1 + b = 0$$ $$\beta^Tx_2 + b = 0$$

Hence we have

$$\beta^Tx_1 + b = \beta^Tx_2 + b$$

We have a dot product that is equal to zero:

$$\beta^T(x_2 - x_1) = \beta^T\vec{v} = 0$$

hence $\beta$ is $\perp$ to the plane.

For any point $x_0$ on the plane we have

$$\beta^T x_0 + b = 0 $$

So we have

$$b = -\beta^T x_0$$

note: $\beta$ is the unit normal vector to the plane.

Compute the distance of a point to the hyperplane

Find the distance d to the hyperplane

We want to compute the distance of the $x$ to the hyperplane.

It is easy to compute the distance to an arbitrary point $x_0$ on the plant we can just take the difference and take the norm.


To get the minimal distance we need to project it in the direction of $\beta$. We do this by taking the dot product with $\hat{\beta} = \frac{\beta}{|\beta|}$

$$d = \hat{\beta}^T(x-x_0) = \hat{\beta}^Tx + b$$

Note that depending on whether x is pointing in the same or opposite direction of $\beta$ the sign of the distance will be different. We can take the absolute value.

Back to Cost function

In the case of missclassifying we want our distance to always be positive. We can do this by multiplying by the class label.

$$d_i = y_i (\beta x_i + b)$$

This will give us a positive distance if our $ (\beta x_i + b)$ sign agrees with $y_i$.

note: we can also take the absolute value, but it isn't differentiable. We want our cost function to be differentiable.

We can now write our cost function as

$$\phi(\beta, b) = \sum_{i \in M} -y_i(\beta x_i + b)$$

Which is the summation over all examples that we have missclassified so far.

note: the negative sign is there so that we make our distance positive, as we have miss classified the example (it is on the wrong side of the plane)

Gradient Descent

Take the partials with respect to $\beta$ and $b$ to find the update rules:

$$\frac{\partial{\phi}}{\partial{\beta}} = \sum_{i \in M}-y_ix_i$$

$$\frac{\partial{\phi}}{\partial{b}} = \sum_{i \in M}-y_i$$

The updates are then:

$$\beta^{new} = \beta^{old} - \alpha \frac{\partial{\phi}}{\partial{\beta}} $$

$$b^{new} = b^{old} - \alpha \frac{\partial{\phi}}{\partial{b}} $$

Online Learning Algorithm

Iterate over data points $(x_i, y_i)$

  1. Pick a sample point $x_i$

  2. Compute $a = \beta^T x_i$

  3. if $y_i(a) > 0$ do nothing

  4. else update $\beta^{new} \leftarrow \beta^{old} + y_ix_i$

Until convergence criteria has been met.

Linear Seperability

If there is a hyperplane that seperates the two classes, then the dataset is said to be linearly seperable.

Linearly Seperable
Not linearly seperable

Properties of Perceptron

  • If the data is linearly seperable the solution is theoretically guaranteed to converge to a hyperplane with perfect accuracy in finite number of steps. See proof here.
  • If there are no seperating hyperplane it isn't gauranteed of course.
  • Perceptron can discriminate only between two classes $\{-1, 1\}$
  • For linearly seperable there are infinite number of solutions
  • Even though there might be solution the number of steps required to reach that solution can be very large and infeasible.
  • In general the smaller the gap between the classes the harder to reach the solution.

Logistic Regression


The main application of logistic regression is in binary classification. However, there are modifications that can be made to the model to accomodae multiclass labels namely the multinomial logistic regression. Two variations for mutli-class classification are

  • One vs All
  • One vs One Approaches

Here we will discuss the simple binary model.


The standard logistic regression is a linear classifier. You may have seen the sigmoid function

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

which is what we are trying to fit to a data set and use as a classifier.

Sigmoid Function

Derivative of Sigmoid $\frac{d\sigma(z)}{dz}$

$$\frac{d\sigma(z)}{dz} = (1 + e^{-z})^{-2} \cdot e^{-z}$$

$$ = \frac{1}{(1 + e^{-z})} [1 - \frac{1}{1 + e^{-z}}]$$

$$ = \sigma(z)\cdot (1 - \sigma(z))$$

Properties of Sigmoid

  • Bounded between $[0, 1]$, we can interpret it as a probability

$$P(y = 1~|~x, w) = \sigma(x)$$ $$P(y = 0~|~x, w) = 1 - P(y = 1~|~x, w) = 1- \sigma(x)$$

  • Everywhere differentiable: can use optimization methods like gradient descent

Logistic as a Linear Classifier

We classify a given example as 1 if $P(y = 1~|~x, w) > 0.5$

We can go either way when $P(y = 1~|~x, w) = 0.5 = P(y = -~|~x, w)$

This corresponds to the following condition:

$$\frac{1}{1 + e^{w^Tx + b}} = 0.5$$

In other words we need

$$w^Tx + b = 0$$

This is the same as the equation for a hyperplane! and it is linear in the parameters of our model, namely, $w$ and $b$.


Bernoulli interpretation, we want to generate y.

Say we assume it follows a bernoulli model

$$y \sim \text{ bernoulli}(\theta)$$

$$P(Y=y) = \begin{cases} \theta,& \text{if } y = 1\\ 1 - \theta, &\text{if } y = 0 \end{cases}$$

Similarly, we want to generate $y$ given the feature vector $x$.

We assume we can model our probability with a function of the form of $\sigma(w^Tx)$

$$\theta = \sigma(w^Tx) $$

$$P(Y=y ~ | ~ x, w) = \begin{cases} \sigma(w^Tx),& \text{if } y = 1\\ 1 - \sigma(w^Tx), &\text{if } y = 0 \end{cases}$$

Conditional Likelihood

We need a cost function to guide our parameters to learn an appropiate decision boundary given a dataset

$$D = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}$$

We want the likelihood function

$$L(y_1, y_2, ..., y_n ~ | x_1, x_2, ..., x_n, w) = P(y_1, y_2, ..., y_n ~ | x_1, x_2, ..., x_n, w)$$

With the assumptions of maximum likelihood, we can write the joint distribution as a product:

$$L(D ~ | ~ w) = \prod p(y_i ~ | ~ x_i)$$

Similar to the short form for bernoulli $P(Y=y) = \theta^x(1-\theta)^{1 - x}$, we can re-write it as

$$L( w) = \prod p(y = 1 ~ | ~ x_i)^{y_i}p(y = 0 ~ | ~ x_i)^{1 - y_i}$$

Conditional Likelihood (cont)

$$logL( w) = \sum log(P(y_i ~ | ~ x_i, w))$$

We get the cross entropy loss function

$$ logL(w) = \sum y_i\text{log}(\sigma(x)) + (1 - y_i)\text{log}(1 - \sigma(x))$$

We want to maximize the above function, we can turn it into a minimization problem by multiplying by $-1$:

$$ -logL(w) = -\sum y_i\text{log}(\sigma(x)) + (1 - y_i)\text{log}(1 - \sigma(x))$$

Gradient Descent

We can use gradient descent to iteratively move our parameters in the direction that would minimize the cross-entropy loss

$$w_i^{(t + 1)} \leftarrow w_i^{(t)} - \lambda \frac{\partial l(w, b)}{w_i}$$

We can also write this in vector form:

$$\nabla l(w, b) = \frac{\partial l(w, b)}{\partial w} = [\frac{\partial l(w, b)}{w_1}, \frac{\partial l(w, b)}{w_2}, ..., \frac{\partial l(w, b)}{w_d}]$$

$$w^{(t + 1)} \leftarrow w^{(t)} - \lambda \nabla_w l(w, b)$$

$$b^{(t + 1)} \leftarrow b^{(t)} - \lambda \nabla_b l(w, b)$$

Compute the $\nabla$ of the cost function for both $w$ and $b$

Compute the hessian $\mathcal{H}$ of the cost function for both $w$ and $b$

Linear Regression

Lecture slides