Projections, Least Squares, and QR Turn Inconsistency into Best Approximation

What This Concept Is

Many real systems are inconsistent: the observation vector $b$ does not lie in the column space of $A$, so $Ax = b$ has no exact solution. Instead of accepting "no solution," the least-squares approach replaces the impossible goal "solve exactly" with the right goal "minimize error in the 2-norm."

Given $Ax = b$ with no exact solution, least squares chooses $\hat x$ to minimize $|Ax - b|^2$. Geometrically, $A\hat x$ is the orthogonal projection of $b$ onto $\text{Col}(A)$. The residual $r = b - A\hat x$ lies in $\text{Col}(A)^\perp = \text{Null}(A^T)$. The orthogonality condition $A^T r = 0$ immediately yields the normal equations

$$A^T A,\hat x = A^T b.$$

When $A$ has independent columns, $A^TA$ is symmetric positive definite, so the normal equations have a unique solution $\hat x = (A^TA)^{-1}A^T b$. The matrix $(A^TA)^{-1}A^T$ is the left pseudoinverse $A^+$; the projection onto $\text{Col}(A)$ is $P = A(A^TA)^{-1}A^T$. Important properties: $P^2 = P$ (idempotent), $P^T = P$ (symmetric), and $Pb = A\hat x$.

QR factorization makes the normal equations numerically cleaner. Write $A = QR$ with $Q$ having orthonormal columns and $R$ upper triangular. Then $A^TA = R^T Q^T Q R = R^T R$ and the normal equations become $R^T R \hat x = R^T Q^T b$, simplifying to $R \hat x = Q^T b$ -- a triangular system. QR produces the answer with condition number $\kappa(A)$ instead of the $\kappa(A)^2$ that plagues explicit $A^TA$ formation. For ill-conditioned $A$, QR (or SVD) is the default; normal equations are the analytical tool.

Distinguish three solution regimes for $Ax = b$:

Square, full rank: $\hat x = A^{-1}b$, exact. Solve by LU.
Overdetermined ($m > n$), full column rank: $\hat x$ is the least-squares solution. Solve by QR.
Underdetermined or rank-deficient: infinitely many least-squares solutions. The minimum-norm one is $\hat x = A^+ b$, computed by SVD.

Why It Matters Here

This is one of the most important pages in the module because it connects orthogonality, approximation, regression, data fitting, and numerical stability. If you want the first serious bridge from textbook linear algebra to ML and data work, this is it. Every linear regression you will ever fit is a least-squares problem; every Kalman filter update is a least-squares update; every principal-component projection is orthogonal projection.

Projections also underlie the geometry of "best explanation" in approximation theory. The orthogonality of the residual means the least-squares estimator extracts everything the columns of $A$ can explain about $b$ and leaves behind a residual that the model provably cannot touch.

Concrete Examples

Example 1 -- projection onto a line. Project $b = (3, 1)^T$ onto the line spanned by $a = (1, 2)^T$. The projection coefficient is $\hat x = \tfrac{a^T b}{a^T a} = \tfrac{3 + 2}{1 + 4} = 1$. The projection is $\hat x \cdot a = (1, 2)^T$. The residual $r = b - \hat x a = (2, -1)^T$. Verify orthogonality: $a^T r = 2 - 2 = 0$. This is least squares in one dimension.

Example 2 -- line of best fit. Fit $y = c + mx$ through the three points $(0, 1), (1, 2), (2, 2)$. Set up

$$A = \begin{pmatrix} 1 & 0 \ 1 & 1 \ 1 & 2 \end{pmatrix}, \quad b = \begin{pmatrix} 1 \ 2 \ 2 \end{pmatrix}.$$

Then

$$A^T A = \begin{pmatrix} 3 & 3 \ 3 & 5 \end{pmatrix}, \quad A^T b = \begin{pmatrix} 5 \ 6 \end{pmatrix}.$$

Solve $\begin{pmatrix} 3 & 3 \ 3 & 5 \end{pmatrix}\begin{pmatrix} c \ m \end{pmatrix} = \begin{pmatrix} 5 \ 6 \end{pmatrix}$. Subtract row 1 from row 2: $2m = 1$, so $m = 0.5$, $c = (5 - 3\cdot 0.5)/3 = 7/6$. Best-fit line: $y = 7/6 + 0.5 x$. Residual: $b - A\hat x = (1, 2, 2)^T - (7/6, 5/3, 13/6)^T = (-1/6, 1/3, -1/6)^T$. Check $A^T r = 0$: sum of residuals is $0$, and sum of $x_i r_i$ is $0$. Orthogonality confirmed.

Example 3 -- QR version of the same fit. Apply Gram-Schmidt to columns of $A$: $q_1 = (1, 1, 1)^T / \sqrt{3}$. Project column 2 onto $q_1$: coefficient $(0 + 1 + 2)/\sqrt{3} = 3/\sqrt{3} = \sqrt{3}$. Subtract: $(0, 1, 2)^T - \sqrt{3}\cdot q_1 = (-1, 0, 1)^T$. Normalize: $q_2 = (-1, 0, 1)^T/\sqrt{2}$. So

$$Q = \begin{pmatrix} 1/\sqrt{3} & -1/\sqrt{2} \ 1/\sqrt{3} & 0 \ 1/\sqrt{3} & 1/\sqrt{2} \end{pmatrix}, \quad R = \begin{pmatrix} \sqrt{3} & \sqrt{3} \ 0 & \sqrt{2} \end{pmatrix}.$$

Compute $Q^T b = (5/\sqrt{3}, 1/\sqrt{2})^T$. Back-solve $R\hat x = Q^T b$: from row 2, $m = (1/\sqrt{2})/\sqrt{2} = 1/2$; from row 1, $\sqrt{3} c + \sqrt{3}\cdot(1/2) = 5/\sqrt{3}$, so $c = 7/6$. Same answer, obtained without forming $A^T A$ explicitly.

Common Confusion / Misconceptions

"Least squares fixes inconsistency by secretly making the equations true." It does not. It finds the best approximation $A\hat x$ within $\text{Col}(A)$ under a squared-error criterion. The equations remain inconsistent; only the residual norm is minimized.

"Use normal equations blindly." Forming $A^T A$ squares the condition number. For moderately ill-conditioned $A$ with $\kappa(A) = 10^6$, normal equations have $\kappa(A^TA) = 10^{12}$ and can lose all precision. QR preserves $\kappa(A)$; SVD goes further with singular-value truncation.

"Projection matrices are invertible." A projection $P$ onto a proper subspace satisfies $P^2 = P$ and has eigenvalues $0$ and $1$ only. It is singular (because of the $0$ eigenvalues on the complement). Idempotence replaces invertibility as the defining algebraic property.

"Least squares requires Gaussian noise." No. Least squares minimizes $|Ax - b|_2^2$ as an approximation criterion, independent of probability. Under a Gaussian noise model, it happens to equal the maximum-likelihood estimator, but the geometric projection argument needs no probabilistic assumption.

How To Use It

When faced with too many equations:

Interpret the system as $b$ trying to land inside $\text{Col}(A)$. If $b \in \text{Col}(A)$, solve exactly; otherwise proceed.
Project $b$ onto $\text{Col}(A)$; the image is $A\hat x$.
Derive the optimality condition $A^T r = 0$ from orthogonality of the residual.
Prefer QR (or SVD) to normal equations whenever conditioning is a concern.
Extract the residual and its norm; report both. Residual norm tells you how good the approximation is.
If $A$ is rank-deficient, switch to the minimum-norm solution via SVD.

In code: np.linalg.lstsq(A, b, rcond=None) returns $\hat x$, residual norm, rank, and singular values. It uses SVD under the hood, handling rank-deficient cases gracefully. scipy.linalg.lstsq is equivalent with more options.

Transfer / Where This Shows Up Later

S2 (algorithms): Nearest-neighbor and locality-sensitive-hashing rely on Euclidean projection geometry; convex optimization projects onto feasible sets iteratively.
S4 (computer organization): Fixed-point and floating-point arithmetic design decisions are informed by condition-number arguments from least squares.
S5 (queueing & signal processing): State estimation (Kalman filtering) is recursive least squares; control systems use projection for constraint enforcement.
S7 (architecture): Service-level objectives' "closest feasible" adjustments are projections onto admissible performance surfaces.
S8 (ranking): Learning-to-rank models fit a linear scoring function by least squares on pairwise preferences. Low-rank matrix completion solves nuclear-norm penalized least squares.
Phase 7 (ML): Linear regression is least squares. Ridge regression is $|Ax - b|^2 + \lambda |x|^2$ -- regularized least squares with a closed-form solution. Early neural-network weight initialization uses QR to produce orthogonal weight matrices. Gradient descent on a convex quadratic is steepest descent toward the least-squares solution.

Check Yourself

Why must the residual be orthogonal to the column space at the optimum?
What does $A^T A$ represent geometrically?
Why is QR often preferred over explicitly forming normal equations?
What is the geometric meaning of the projection matrix $P$, and why does $P^2 = P$?
When does least squares have a unique solution, and when does it have infinitely many?
Why is a small residual not the same as a small error in the parameters $x$?

Mini Drill or Application

Project $b = (3, 1)^T$ onto the line spanned by $a = (1, 2)^T$. Verify the residual is orthogonal to $a$.
Set up and solve the least-squares problem for fitting $y = c + mx$ to the points $(0, 1), (1, 2), (2, 2)$. Report $(c, m)$, the residual vector, and the residual norm.
In NumPy: x, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None). Use the $A$ and $b$ from the previous drill. Compare x to your hand answer.
Implement Gram-Schmidt on the same $A$ to compute $Q$ and $R$; solve $R\hat x = Q^T b$ by back-substitution; confirm the answer matches.
Write in one sentence why the squared 2-norm, rather than the 1-norm or $\infty$-norm, produces the clean orthogonality geometry used above. (Hint: differentiate the 1-norm at a sign change.)

What This Concept Is​

Why It Matters Here​

Concrete Examples​

Common Confusion / Misconceptions​

How To Use It​

Transfer / Where This Shows Up Later​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​