Positive Definite Matrices Turn Quadratic Forms into Geometry

What This Concept Is

A symmetric real matrix $A$ is positive definite (PD) iff

$$x^T A x > 0 \quad \text{for every nonzero } x \in \mathbb{R}^n,$$

and positive semidefinite (PSD) iff $x^T A x \ge 0$. Symmetry is part of the definition (the standard theory needs it for the spectral picture to work cleanly); an asymmetric matrix satisfying the inequality is usually turned into $\tfrac{1}{2}(A + A^T)$ first.

Positive definiteness has five equivalent characterizations (any one of them proves all the others):

$x^T A x > 0$ for all nonzero $x$ (the definition).
All eigenvalues of $A$ are strictly positive.
All leading principal minors $\det(A_k)$ (top-left $k \times k$ submatrix determinants) are strictly positive -- Sylvester's criterion.
There exists an invertible $R$ such that $A = R^T R$; the Cholesky decomposition $A = L L^T$ with lower-triangular $L$ having positive diagonal is the canonical one.
All pivots in Gaussian elimination (without row swaps) are positive.

Distinguish five related conditions, in decreasing strength:

Positive definite: $x^T A x > 0$, $\lambda_i > 0$.
Positive semidefinite: $x^T A x \ge 0$, $\lambda_i \ge 0$ (nullspace allowed).
Indefinite: some positive and some negative eigenvalues (saddle behavior).
Negative semidefinite / negative definite: flipped signs.
Symmetric: just $A = A^T$, no sign condition.

The quadratic form $Q(x) = x^T A x$ defines a function from $\mathbb{R}^n$ to $\mathbb{R}$. If $A$ is PD, $Q$ looks like a bowl centered at the origin: level sets ${x : Q(x) = c}$ are ellipsoids whose axis lengths are $\sqrt{c/\lambda_i}$ along the corresponding eigenvector directions. If indefinite, level sets are hyperboloids (saddles). Geometry, algebra, and optimization align cleanly when $A$ is PD.

Why It Matters Here

Positive definiteness explains: why least-squares normal equations $A^TA \hat x = A^T b$ have a unique solution when $A$ has independent columns (because $A^TA$ is PD); why some quadratic objectives have unique minima (PD Hessian); why Cholesky factorization exists and is stable (PD is the precondition); and how curvature enters optimization and ML.

This is the bridge from linear algebra into optimization thinking. Whenever a loss function is quadratic or locally quadratic near a minimum, the Hessian's positive definiteness is the condition for a genuine minimum. Whenever you see a covariance matrix, it is PSD by construction; if the data matrix has independent features, strict PD. Cholesky is the fastest way to solve $Kx = b$ when $K$ is SPD -- half the cost of LU.

Concrete Examples

Example 1 -- diagonal PD. Take

$$A = \begin{pmatrix} 2 & 0 \ 0 & 5 \end{pmatrix}.$$

Then $x^T A x = 2 x_1^2 + 5 x_2^2 > 0$ for all nonzero $x$. Eigenvalues are 2 and 5 (positive). Level sets $2 x_1^2 + 5 x_2^2 = 1$ are ellipses with semi-axes $1/\sqrt{2}$ and $1/\sqrt{5}$ along the standard directions. PD.

Example 2 -- symmetric PD via eigenvalues. For

$$A = \begin{pmatrix} 2 & -1 \ -1 & 2 \end{pmatrix},$$

the characteristic polynomial is $\lambda^2 - 4\lambda + 3 = (\lambda-1)(\lambda-3)$. Eigenvalues $1, 3 > 0$, so PD. Sylvester's criterion: leading minors are $\det(A_1) = 2 > 0$ and $\det(A_2) = \det A = 3 > 0$. Cholesky: $A = L L^T$ with $L = \begin{pmatrix} \sqrt{2} & 0 \ -1/\sqrt{2} & \sqrt{3/2} \end{pmatrix}$ (verify by direct multiplication). All five characterizations agree.

Example 3 -- indefinite saddle. For

$$B = \begin{pmatrix} 1 & 0 \ 0 & -1 \end{pmatrix}, \quad x^T B x = x_1^2 - x_2^2.$$

Positive along the $x_1$ axis, negative along the $x_2$ axis, zero along the diagonal $x_1 = \pm x_2$. Level sets are hyperbolas. Eigenvalues $1, -1$ -- indefinite. Geometrically, this is a saddle, not a bowl. In optimization, a Hessian with this sign pattern at a critical point diagnoses a saddle rather than a min or max.

Example 4 -- PD by construction. For any real matrix $M$ with independent columns, $M^T M$ is SPD. Proof: $x^T M^T M x = |Mx|^2 \ge 0$; it is zero iff $Mx = 0$ iff $x = 0$ (column independence). This is why $A^T A$ in the normal equations is PD whenever the regression design matrix has independent features. Cholesky factorize it to solve the normal equations at half the cost of LU.

Common Confusion / Misconceptions

"Positive diagonal entries alone make a matrix PD." False. Consider $A = \begin{pmatrix} 1 & 2 \ 2 & 1 \end{pmatrix}$: positive diagonal, but $x^T A x = x_1^2 + 4 x_1 x_2 + x_2^2$, which equals $-2$ at $x = (1, -1)^T$. Off-diagonal entries can destroy definiteness.

"Symmetric is the same as positive definite." Symmetry is a requirement of the standard theory but is not sufficient. $B = \begin{pmatrix} 1 & 0 \ 0 & -1 \end{pmatrix}$ is symmetric but not PD.

"PD means invertible." PD is strictly stronger. It implies invertibility (all eigenvalues nonzero), but the reverse is false: $-I$ is invertible but definitely not PD.

"Any matrix of the form $M^TM$ is positive definite." $M^T M$ is always PSD; it is PD iff $M$ has linearly independent columns. If $M$ has dependent columns (e.g., rank-deficient design matrix), $M^T M$ is only PSD and has a nontrivial nullspace -- one of the reasons regularization (ridge regression, adding $\lambda I$) is used.

"Sylvester's criterion uses all principal minors." No -- only the leading principal minors (top-left $k \times k$). For PSD you need to check all principal minors, not just leading ones.

How To Use It

When checking positive definiteness:

Confirm symmetry; if the matrix is not symmetric, work with $\tfrac{1}{2}(A + A^T)$ or step back to a broader theory.
Use one of the five equivalent tests. For small hand work: Sylvester's criterion. For numerical work: attempt Cholesky (scipy.linalg.cholesky) -- it succeeds iff PD.
For semidefinite: compute eigenvalues and check they are all $\ge 0$ (np.linalg.eigvalsh(A).min()).
Interpret the result in optimization language:
- PD: unique local (in fact global, for quadratics) minimum structure.
- PSD: flat or degenerate minimum -- possibly infinitely many minimizers.
- Indefinite: saddle behavior; not a minimum.
Use Cholesky rather than LU whenever $A$ is known SPD; about half the cost.
If $A$ is not PD but you need it to be (covariance regularization), add $\lambda I$ with small $\lambda > 0$ to shift all eigenvalues up.

Transfer / Where This Shows Up Later

S2 (algorithms): Kernel methods require the kernel Gram matrix to be PSD; Mercer's theorem links this to the existence of a feature-space embedding.
S4 (computer organization): Cholesky is the standard SPD solver in LAPACK (dpotrf); it is half the cost of LU and perfectly parallelizable block-wise.
S5 (queueing & probability): Covariance matrices are PSD by construction (Introduction to Probability chapter 7.3 and 7.5 on covariance and multivariate normal). Mahalanobis distance $\sqrt{x^T \Sigma^{-1} x}$ is a metric iff $\Sigma$ is PD.
S7 (architecture): Trust-region methods in multi-stakeholder decisions use PD weight matrices to penalize deviation from a nominal plan.
S8 (ranking): PSD kernels appear in kernel-based recommendation; determinantal point processes use PSD kernels for diversity.
Phase 7 (ML): Gradient descent's local convergence rate near a minimum is governed by the PD Hessian's condition number $\kappa = \lambda_{\max}/\lambda_{\min}$. Newton's method uses $\nabla^2 f(x)^{-1}$ and requires PD Hessian; line-search or trust-region methods enforce a PD-like property. Gaussian log-likelihood contains $-\tfrac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)$, which is a PD quadratic form. L-BFGS maintains a PD approximation to the inverse Hessian.

Check Yourself

Why is symmetry part of the standard definition of positive definiteness?
What does $x^T A x$ represent geometrically when $A$ is PD?
Why does a positive definite matrix have positive eigenvalues -- and vice versa?
Why is $A^T A$ always PSD, and when is it strictly PD?
What is the optimization meaning of the Hessian being indefinite at a critical point?
What is Sylvester's criterion, and why does it fail for PSD without modification?

Mini Drill or Application

For each matrix, classify it as positive definite, positive semidefinite, indefinite, or negative (semi)definite:

$\begin{pmatrix} 3 & 0 \ 0 & 2 \end{pmatrix}$
$\begin{pmatrix} 1 & 2 \ 2 & 1 \end{pmatrix}$
$\begin{pmatrix} 2 & -1 \ -1 & 2 \end{pmatrix}$
$\begin{pmatrix} 1 & 1 \ 1 & 1 \end{pmatrix}$

Then explain which one would define a strictly stable quadratic loss surface.

In NumPy: generate X = np.random.randn(100, 5) and compute S = X.T @ X; verify S is SPD by attempting np.linalg.cholesky(S) (success = PD). Then generate a rank-deficient $X$ (repeat a column) and observe Cholesky failure.
Derive the Cholesky factor $L$ of $\begin{pmatrix} 4 & 2 \ 2 & 5 \end{pmatrix}$ by hand. Verify $L L^T$ reproduces the matrix.

What This Concept Is​

Why It Matters Here​

Concrete Examples​

Common Confusion / Misconceptions​

How To Use It​

Transfer / Where This Shows Up Later​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​