Linear Models

By Sajad Darabi

February 2, 2017

Strive for progress. Everyday is a new begining or a new ending.


  • Logistic Regression
  • Linear Regression
  • Convex Functions
  • Maximum Likelihood

Logistic Regression


The main application of logistic regression is in binary classification.

However, there are modifications that can be made to the model to accomodae multiclass labels namely the multinomial logistic regression. Two variations for mutli-class classification are

  • One vs All
  • One vs One Approaches

Here we will discuss the simple binary model.


The standard logistic regression is a linear classifier. You may have seen the sigmoid function

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

which is what we are trying to fit to a data set and use as a classifier.

Sigmoid Function

Derivative of Sigmoid $\frac{d\sigma(z)}{dz}$

$$\frac{d\sigma(z)}{dz} = (1 + e^{-z})^{-2} \cdot e^{-z}$$

$$ = \frac{1}{(1 + e^{-z})} [1 - \frac{1}{1 + e^{-z}}]$$

$$ = \sigma(z)\cdot (1 - \sigma(z))$$

Properties of Sigmoid

  • Bounded between $[0, 1]$, we can interpret it as a probability

$$P(y = 1~|~x, w) = \sigma(x)$$ $$P(y = 0~|~x, w) = 1 - P(y = 1~|~x, w) = 1- \sigma(x)$$

  • Everywhere differentiable: can use optimization methods like gradient descent

Logistic Unit as a Linear Classifier

We classify a given example as 1 if $P(y = 1~|~x, w) > 0.5$

We can go either way when $P(y = 1~|~x, w) = 0.5 = P(y = -~|~x, w)$

This corresponds to the following condition:

$$\frac{1}{1 + e^{w^Tx + b}} = 0.5$$

In other words we need

$$w^Tx + b = 0$$

This is the same as the equation for a hyperplane! and it is linear in the parameters of our model, namely, $w$ and $b$.

Sigmoid Classifier


Bernoulli interpretation, we want to generate y.

Say we assume it follows a bernoulli model

$$y \sim \text{ bernoulli}(\theta)$$

$$P(Y=y) = \begin{cases} \theta,& \text{if } y = 1\\ 1 - \theta, &\text{if } y = 0 \end{cases}$$

Similarly, we want to generate $y$ given the feature vector $x$.

We assume we can model our probability with a function of the form of $\sigma(w^Tx)$

$$\theta = \sigma(w^Tx) $$

$$P(Y=y ~ | ~ x, w) = \begin{cases} \sigma(w^Tx),& \text{if } y = 1\\ 1 - \sigma(w^Tx), &\text{if } y = 0 \end{cases}$$

How to Find $w$?

Maximum Likelihood Estimation

In MLE is a method of estimating model parameters by maximizing the likelihood function.

The liklihood of observing our dataset given the parameter models is given by

$$ L(\theta; D) = P(D|\theta) = P((x_1, y_1), (x_2, y_2), ... (x_n, y_n)| \theta) = \prod_{k = 1}^np((x_k, y_k)|\theta)$$

Goal: Determine the parametrs $\theta$ of our model.

We can achieve this by finding the $\theta$ that maximizes the likelihood function:

$$\hat{\theta} = argmax_{\theta}~L(\theta)$$

Conditional Likelihood

We need a cost function to guide our parameters to learn an appropiate decision boundary given a dataset

$$D = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}$$

We want the likelihood function

$$L(y_1, y_2, ..., y_n ~ | x_1, x_2, ..., x_n, w) = P(y_1, y_2, ..., y_n ~ | x_1, x_2, ..., x_n, w)$$

With the assumptions of maximum likelihood, we can write the joint distribution as a product:

$$L(D ~ | ~ w) = \prod p(y_i ~ | ~ x_i)$$

Similar to the short form for bernoulli $P(Y=y) = \theta^y(1-\theta)^{1 - y}$, we can re-write it as

$$L(\theta) = \prod p(y = 1 ~ | ~ x_i)^{y_i}p(y = 0 ~ | ~ x_i)^{1 - y_i}$$

Conditional Likelihood (cont)

$$logL(w, b) = \sum_i log(P(y_i ~ | ~ x_i, w, b))$$

We get the cross entropy loss function

$$ logL(w, b) = \sum_i y_i\text{log}(\sigma(w^Tx_i + b)) + (1 - y_i)\text{log}(1 - \sigma(w^Tx_i + b))$$

We want to maximize the above function, we can turn it into a minimization problem by multiplying by $-1$:

$$ -logL(w, b) = -\sum_i y_i\text{log}(\sigma(w^Tx_i + b)) + (1 - y_i)\text{log}(1 - \sigma(w^Tx_i + b))$$


Suppose $X_1, X_2, ..., X_n$ are i.i.d random variables with density

$$f(x~|~\sigma) = \frac{1}{2\sigma}e^{-\frac{|x|}{\sigma}}$$

Use MLE to find $\sigma$

Convex Functions

A function $f : \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if it satisfies

$$f(tx_1 + (1 - t)x_2) \leq tf(x_1) + (1 - t)f(x_2)$$

It is also known as Jensen's inequality. It says that for convex functions a secant line lies above the actual graph.

Checking for Convexity

Use Jensen's inequality:

$$f(E[x]) \leq E[f(x)]$$

Second order condition:

$$\mathcal{H} = \nabla^2f(x)_{ij} = \frac{\partial^2f(x)}{\partial x_i \partial x_j}$$

for $i, j = 1, ..., n$ exists and

$$\mathcal{H} \geq 0$$

You can show that the hessian is positive by showing it is positive-semi-definite:

$$v^T\mathcal{H}v \geq 0$$

Useful Matrix Derivatives

  • $\alpha = y^TAx \rightarrow \frac{\partial \alpha}{\partial x} = y^T A$
  • $\alpha = y^TAx \rightarrow \frac{\partial \alpha}{\partial y} = x^T A^T$
  • $\alpha = x^TAx \rightarrow \frac{\partial \alpha}{\partial x} = x^T(A + A^T)$

Show that the least square objective function is convex

$$f(x) = ||Ax - b||_2^2$$

Show the following function is convex

$$f(x) = e^x + |x|$$

Linear Regression

Use lecture slides.

Thank you!