Matrix Multiplication and Factorization Encode Composition

What This Concept Is

Matrix multiplication is not a decorative rule for arranging indices. It is the algebra of composition. If $B$ acts first and $A$ acts second on a vector $v$, then the combined action is $(AB)v = A(Bv)$. That single fact explains why order matters, why inner dimensions must match, why matrices naturally represent chained transformations, and why elimination packages up into a factorization $A = LU$ (or $PA = LU$ when row exchanges are needed).

Multiplication can be read four different ways, and fluency requires all four. Row-times-column: entry $(AB){ij}$ is the dot product of row $i$ of $A$ with column $j$ of $B$. Column picture: column $j$ of $AB$ is $A$ times column $j$ of $B$, so $AB$ is "apply $A$ to each column of $B$." Row picture: row $i$ of $AB$ is row $i$ of $A$ times $B$. Outer-product sum: $AB = \sum_k A{:k} B_{k:}$, a sum of rank-one matrices built from column $k$ of $A$ and row $k$ of $B$ -- this is the viewpoint SVD and low-rank approximation require.

Factorization matters because it turns one expensive operation into reusable structure. $LU$ comes from elimination without row exchanges; $PA = LU$ from elimination with partial pivoting; $QR$ comes from orthogonalizing the columns (covered in Cluster 3); $A = S\Lambda S^{-1}$ from eigenvectors; $A = U\Sigma V^T$ from SVD. Every factorization takes the form "write $A$ as a product of structured factors whose action on vectors is easy to execute."

Distinguish $LU$ from other factorizations. $LU$ is cheap ($\tfrac{2}{3}n^3$ flops), always exists up to permutation, and is the workhorse for general dense linear solves. $QR$ is twice as expensive but numerically safer and the right tool when $A$ is rectangular or ill-conditioned. Cholesky is half the cost of $LU$ but requires $A$ symmetric positive definite. $LDL^T$ is the symmetric-indefinite cousin. SVD is universal but the most expensive. Choose by the structure you actually have.

The column picture is also the fastest way to remember the legality rule. If $A$ is $m \times k$ and $B$ is $k \times n$, each column of $B$ lives in $\mathbb{R}^k$; $A$ eats vectors in $\mathbb{R}^k$ and emits vectors in $\mathbb{R}^m$; the output has $n$ columns like $B$ and $m$ rows like $A$. Inner dimensions match or the product is meaningless -- not a convention, a type constraint. Every shape mistake in a NumPy traceback is a violation of this composition constraint.

Why It Matters Here

You will use multiplication and factorization everywhere: composing linear transformations, expressing repeated action as $A^k$, building least-squares systems from $A^TA$, defining similarity transforms $S^{-1}AS$, and understanding $QR$ and SVD as organized products of orthogonal and triangular or diagonal pieces. If multiplication remains a memorized row-by-column ritual, the rest of the module stays shallow.

Factorization is also where asymptotic complexity meets software engineering. Once you see that $LU$ can be amortized across many right-hand sides, you understand why every linear-algebra library separates factor and solve phases; why sparse solvers (SuiteSparse, MUMPS) publish their symbolic and numeric steps separately; and why cache-aware blocked factorizations exist.

The cost model is blunt: forming $A^{-1}$ and doing $A^{-1} b$ is $\tfrac{8}{3} n^3$ flops; $LU$ followed by forward/back-substitution is $\tfrac{2}{3} n^3 + 2 n^2$ flops. The factor-and-solve split is the engineering; everything else downstream assumes you have adopted it.

Concrete Examples

Example 1 -- order matters. Let

$$B = \begin{pmatrix} 1 & 2 \ 0 & 1 \end{pmatrix}, \quad A = \begin{pmatrix} 2 & 0 \ 0 & 3 \end{pmatrix}.$$

$B$ is a horizontal shear; $A$ is an anisotropic scaling. Then

$$AB = \begin{pmatrix} 2 & 4 \ 0 & 3 \end{pmatrix}, \quad BA = \begin{pmatrix} 2 & 6 \ 0 & 3 \end{pmatrix}.$$

They are different because $AB$ means "shear, then scale," whereas $BA$ means "scale, then shear." The unit square maps to different parallelograms depending on the order. That is non-commutativity made visible.

Each view answers a different question cleanly. Need to know if $b$ is reachable as $ABx$? Use the column view. Need to know which inputs are silenced? Use the row view. Need to compress $AB$? Use the outer-product view and truncate small-norm pieces.

Example 2 -- LU captures elimination. For

$$A = \begin{pmatrix} 2 & 1 & 1 \ 4 & 3 & 3 \ 8 & 7 & 9 \end{pmatrix},$$

run elimination and store the multipliers $\ell_{21} = 2$, $\ell_{31} = 4$, $\ell_{32} = 3$ in $L$:

$$L = \begin{pmatrix} 1 & 0 & 0 \ 2 & 1 & 0 \ 4 & 3 & 1 \end{pmatrix}, \quad U = \begin{pmatrix} 2 & 1 & 1 \ 0 & 1 & 1 \ 0 & 0 & 2 \end{pmatrix}.$$

Verify $LU = A$ by block multiplication. To solve $Ax = b$ for any right-hand side, run forward substitution $Ly = b$ ($O(n^2)$), then back substitution $Ux = y$ ($O(n^2)$). The one-time $O(n^3)$ factor is amortized across all right-hand sides.

Example 3 -- outer-product view. The product

$$\begin{pmatrix} 1 \ 2 \ 3 \end{pmatrix}\begin{pmatrix} 4 & 5 \end{pmatrix} = \begin{pmatrix} 4 & 5 \ 8 & 10 \ 12 & 15 \end{pmatrix}$$

is a rank-1 matrix; every column is a scalar multiple of $(1,2,3)^T$. Every matrix can be written as a sum of rank-1 outer products (this is the column picture with the sum reordered), which is exactly the statement that SVD approximates $A$ by truncating such a sum.

Example 4 -- block multiplication. If $A$ and $B$ are partitioned into compatible blocks, the product obeys the same row-by-column rule at the block level:

$$\begin{pmatrix} A_{11} & A_{12} \ A_{21} & A_{22} \end{pmatrix}\begin{pmatrix} B_{11} & B_{12} \ B_{21} & B_{22} \end{pmatrix} = \begin{pmatrix} A_{11}B_{11} + A_{12}B_{21} & A_{11}B_{12} + A_{12}B_{22} \ A_{21}B_{11} + A_{22}B_{21} & A_{21}B_{12} + A_{22}B_{22} \end{pmatrix}.$$

This is not a convenience -- it is the structural reason cache-blocked GEMM works. Tile the matrices into blocks that fit in L1/L2 cache, multiply blocks via the formula above, and you can reach >90% of hardware peak throughput. Strassen's $O(n^{\log_2 7})$ algorithm is a clever re-parenthesization of the 2×2 block product that trades more additions for fewer multiplications.

Common Confusion / Misconceptions

"$AB$ is close to $BA$ in practice." False. They are equal iff $A$ and $B$ share an eigenbasis. In general $AB \ne BA$, and the commutator $[A,B] = AB - BA$ is itself an important object (in physics, Lie theory, and parallel-algorithm analysis).

"Matrix multiplication is associative and distributive because the definition is." Associativity $(AB)C = A(BC)$ is a theorem, not a definition, and it is the reason the function-composition picture is consistent. Distributivity $A(B+C) = AB + AC$ follows from the same row-by-column structure.

"$LU$ is a trick for homework problems." $LU$ is the foundation of every dense direct solver in LAPACK and the starting point for sparse direct solvers. When you call numpy.linalg.solve, that is $LU$ with partial pivoting executed in compiled code.

"Factorization is only about speed." Factorization is also about structure recovery: $LU$ exposes pivots and rank, $QR$ exposes orthogonal bases of the column space, Cholesky certifies positive definiteness, and SVD exposes singular values. Each factorization answers a different structural question in addition to accelerating solves.

"The row-by-column rule is the only way to think about $AB$." It is the computational rule; it is not the insight. When you need to reason about what $AB$ does, reach for one of the other three views. Column view for "what lives in the range." Row view for "what coordinates survive." Outer-product view for "how much rank can accumulate." Experienced practitioners switch views constantly mid-derivation without noticing.

How To Use It

When multiplying matrices, ask:

What acts first, and what acts second? Write the composition in the correct order.
Do the inner dimensions match? If $A$ is $m\times k$ and $B$ is $k\times n$, then $AB$ is $m\times n$.
Which of the four views (row-by-column, column action, row action, outer-product sum) best answers my question?
Is this composition legal in the spaces involved?
Is there a factorization that makes repeated solves or powers cheaper?

When you see a long chain of operations, stop expanding blindly and look for structure: triangular factors, diagonal factors, orthogonal factors. Those factorizations are where computational advantage comes from.

Finally, remember the re-parenthesization lever: $(ABC)x$ can be evaluated as $A(B(Cx))$ (three matrix-vector products, $O(n^2)$ each) or as $(AB)(Cx)$ or as $(ABC)x$ (two matrix-matrix products then one matrix-vector, $O(n^3)$ total). For a single $x$, always fold inward. For many right-hand sides, compute the product once and reuse. Matrix-chain multiplication (S2) is exactly this decision generalized.

For code: prefer scipy.linalg.lu_factor + lu_solve over repeated np.linalg.solve; prefer scipy.linalg.cho_factor for SPD systems; prefer np.linalg.qr for least-squares with modest conditioning; only reach for SVD when rank questions dominate.

A practical diagnostic. Any expression of the form $M = F_1 F_2 \cdots F_k$ where the $F_i$ are structured (diagonal, triangular, orthogonal) is an opportunity. You can often solve $Mx = b$ by walking the factors in sequence, each costing $O(n^2)$ or better, instead of materializing $M$ and paying $O(n^3)$. The discipline is "see the factorization, then solve through it" -- not "multiply out, then solve."

Transfer / Where This Shows Up Later

S2 (algorithms): Matrix-chain multiplication is a classic dynamic-programming problem precisely because associativity lets you re-parenthesize to minimize flops. Boolean matrix products drive graph-reachability algorithms.
S4 (computer organization): GEMM (general matrix-matrix multiply) is the most optimized kernel in BLAS; SIMD, cache tiling, and GPU tensor cores exist largely to accelerate it.
S5 (databases & queueing): Markov chain composition ($P^k$) is repeated matrix multiplication; transition-matrix powers describe multi-step distributions.
S7 (architecture): Component interaction matrices compose by multiplication; coupling transitivity is captured by boolean-semiring products.
S8 (ranking & recommendation): Collaborative filtering factorizes user-item matrices as $R \approx UV^T$ (a rank-$k$ outer-product sum). Training such models is gradient descent on this factorization.
Phase 7 (ML): A feed-forward network is a chain of $A_\ell \sigma(A_{\ell-1}\sigma(\cdots A_1 x))$; the Jacobian of the network is a product of weight matrices and diagonal activation Jacobians, and backpropagation is the associative re-parenthesization of that product. Mixed-precision matmul (bfloat16/fp16) and tensor cores extend the GEMM story onto specialized hardware.
Across phases: Every time you see "preconditioned" attached to a solver name (PCG, PGMRES), that preconditioner $M^{-1}$ is a factorization choice, not a matrix. Picking it well is the single largest lever for iterative-solver performance.

Check Yourself

Why does "$AB$ means do $B$ first" follow from the definition of matrix-vector multiplication?
What practical advantage does $A = LU$ give over solving from scratch each time?
Why is matrix multiplication dimension-sensitive, and what goes wrong if you multiply a $3\times 2$ by a $3\times 2$?
Give a $2\times 2$ example where $AB = BA$ and explain structurally why.
When is $LU$ not sufficient, and what alternative factorization do you use?
Why does the outer-product view matter for compression?
How does block multiplication let a $1000 \times 1000$ GEMM run near hardware peak?
Why does the transpose identity $(AB)^T = B^T A^T$ follow from "$AB$ means do $B$ first"?

Mini Drill or Application

Compute $AB$ and $BA$ for $A = \begin{pmatrix} 1 & 1 \ 0 & 1 \end{pmatrix}$ and $B = \begin{pmatrix} 0 & 1 \ 1 & 0 \end{pmatrix}$. Describe in words what each product does to a vector.
Solve one $3\times 3$ system by elimination and extract the elimination multipliers you would store in $L$.
In NumPy, compute from scipy.linalg import lu_factor, lu_solve; lu, piv = lu_factor(A); x = lu_solve((lu, piv), b). Time this against np.linalg.solve(A, b) inside a loop of 1000 calls with varying $b$; explain the speedup.
Given two rank-1 matrices $u_1 v_1^T$ and $u_2 v_2^T$, show by the outer-product view that their sum has rank at most 2. Find a specific choice where the rank is exactly 1.
Compute $A^2$, $A^3$, $A^4$ for $A = \begin{pmatrix} 1 & 1 \ 0 & 1 \end{pmatrix}$ and spot the pattern. How does this connect to Jordan blocks (preview of Cluster 4)?
Using block multiplication, verify that the $2\times 2$ block form of $I$ times $A = \begin{pmatrix} A_{11} & A_{12} \ A_{21} & A_{22} \end{pmatrix}$ reproduces $A$ block-by-block. Explain why this legality underpins cache-tiled GEMM.
Compute both $AB$ and $BA$ for $A, B \in \mathbb{R}^{3\times 3}$ of your choice. Report the commutator $[A, B] = AB - BA$ and verify $\text{tr}([A, B]) = 0$ always. What does the vanishing trace say about the composition?

What This Concept Is​

Why It Matters Here​

Concrete Examples​

Common Confusion / Misconceptions​

How To Use It​

Transfer / Where This Shows Up Later​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​