Linear Regression: The Foundation
Regression models the relationship between independent variables $\mathbf{x}$ and a dependent variable $y$. Linear regression assumes this relationship is a weighted sum:
$$\hat{y} = w_1 x_1 + \ldots + w_d x_d + b = \mathbf{w}^T \mathbf{x} + b$$
For a full dataset with design matrix $\mathbf{X}$ (one row per example, one column per feature):
$$\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + b$$
The Two Core Assumptions
1. Linearity: $y$ can be expressed as a weighted sum of the features. This is the structural assumption — if the true relationship is non-linear, linear regression will systematically underfit.
2. Gaussian noise: any deviation between $y$ and $\hat{y}$ follows a Gaussian distribution. This assumption justifies the squared error loss — minimizing squared error is equivalent to maximizing the likelihood of observed data under Gaussian noise.
When noise is not Gaussian — for example, heavy-tailed or asymmetric distributions — squared error is no longer the optimal loss. Huber loss, quantile loss, or absolute error may be more appropriate.
The Loss Function
The squared error for a single example:
$$l^{(i)}(\mathbf{w}, b) = \frac{1}{2}(\hat{y}^{(i)} - y^{(i)})^2$$
Averaged over $n$ training examples:
$$L(\mathbf{w}, b) = \frac{1}{n} \sum_{i=1}^n \frac{1}{2}(\mathbf{w}^T \mathbf{x}^{(i)} + b - y^{(i)})^2$$
The factor of $\frac{1}{2}$ is a calculus convenience — the derivative of $\frac{1}{2}u^2$ is $u$, which cancels cleanly. The goal is:
$$\mathbf{w}^, b^ = \arg\min_{\mathbf{w}, b} L(\mathbf{w}, b)$$
Analytic Solution
For linear regression, a closed-form solution exists: $\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$. In practice, gradient descent is used instead — the analytic solution requires inverting an $d \times d$ matrix, which is $O(d^3)$ and numerically unstable for large $d$ or near-singular matrices.
Logistic Regression: Classification via Linear Models
Logistic regression models the probability that a binary outcome is positive:
$$P(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}$$
The sigmoid function $\sigma$ maps any real number to $(0, 1)$, turning a linear output into a probability.
The Loss: Cross-Entropy
Logistic regression minimizes cross-entropy (log loss), not squared error:
$$L = -\frac{1}{n}\sum_{i=1}^n \left[ y^{(i)} \log \hat{p}^{(i)} + (1 - y^{(i)}) \log(1 - \hat{p}^{(i)}) \right]$$
This is the negative log-likelihood under a Bernoulli distribution — the correct loss when the noise assumption is binary rather than Gaussian.
The Key Identity
Logistic Regression ≈ Gaussian Naive Bayes
Logistic regression is a discriminative model: it learns $P(y \mid \mathbf{x})$ directly. Gaussian Naive Bayes is a generative model: it learns $P(\mathbf{x} \mid y)$ and $P(y)$, then applies Bayes’ theorem.
When the class-conditional distributions $P(\mathbf{x} \mid y)$ are Gaussian with equal covariance matrices, these two models are equivalent — they produce the same decision boundary. Logistic regression estimates the boundary directly from data; Naive Bayes estimates the class distributions and derives the boundary implicitly. In practice, logistic regression is preferred because it makes fewer parametric assumptions and is more robust when the Gaussian assumption is violated.
From Linear Models to Neural Networks
A single linear layer is exactly linear regression (with squared error) or logistic regression (with cross-entropy). A neural network is a composition of multiple linear layers with non-linear activation functions between them.
The key insight: every neuron in a neural network is doing linear regression on its inputs. The non-linearity (ReLU, sigmoid, tanh) is what allows the composition to represent non-linear functions. Without activation functions, any depth of linear layers collapses to a single linear transformation — the product of weight matrices is still a matrix.
Understanding linear models is not just historical context — it is the foundation for understanding regularization (L1/L2 penalties on $\mathbf{w}$), optimization dynamics (gradient descent on a convex loss), and the role of the bias term in every network layer.