Testing Beyond the Given Examples
What This Concept Is
"It passes the sample inputs" is the weakest possible evidence of correctness. Serious testing of an algorithm has four layers:
- Provided examples. The examples in the problem statement. Necessary, not sufficient.
- Edge cases. Empty input, size 1, all-equal elements, maximum allowed size, negative values, duplicates, already-sorted or reverse-sorted, overflow boundaries, inputs that exercise every branch of every
if. - Stress testing. Generate random inputs, run your algorithm and a brute-force oracle, compare outputs. Exhaustive for small inputs; randomized for larger. The oracle is the key piece.
- Property-based tests. Check structural invariants of the output - is the result sorted? is it a permutation of the input? does it satisfy the constraint? - not just equality with a golden answer.
This is the inverse of unit testing's "pick a few cases." Adversarial property-plus-oracle testing gives high confidence on the infinite family of inputs at the cost of a slightly slower but always-correct oracle.
Why It Matters Here
Algorithmic bugs often live where nobody looks: one-element arrays, ties, overflow at $n = 10^9$, duplicate keys in a "unique" assumption, $k = 0$ in a top-$k$ query, empty matches in a regex. Passing the problem's three visible tests predicts almost nothing about these.
Stress testing with a brute-force oracle is the single highest-ROI technique for catching algorithmic bugs that reach production or a graded submission. For every ADR you write in S7 that says "we chose algorithm X," the underlying fitness function is "did X agree with the oracle on $10^4$ random inputs?"
Testing is also where algorithmic confidence is built: an engineer who consistently runs stress tests before committing does not need to re-derive correctness every time they refactor; the oracle will tell them if something broke.
Concrete Example
Example 1 (generic stress harness).
import random
def stress(my, brute, generator, rounds=500):
for _ in range(rounds):
inp = generator()
got = my(inp)
want = brute(inp)
if got != want:
print("Mismatch on input:", inp)
print("my:", got, "brute:", want)
return False
return True
gen = lambda: [random.randint(-10, 10) for _ in range(random.randint(0, 8))]
assert stress(my_solution, brute_solution, gen)
This pattern has caught more real bugs for more real students than every edge-case checklist combined. The key is to keep the inputs small enough that brute force is instant and varied enough to hit corner cases the examples missed.
Additional hygiene:
- Shrink failing inputs. On mismatch, repeatedly remove one element and re-run; keep the smallest input that still fails. (QuickCheck and Hypothesis do this automatically.)
- Seed the RNG. Reproducibility matters. Log the seed so a failure can be replayed deterministically.
- Escalate size. Once 500 rounds pass at $n \le 8$, run 500 more at $n \le 50$, and so on.
Example 2 (property-based test for sort).
from collections import Counter
def is_valid_sort(original, result):
return result == sorted(result) and Counter(result) == Counter(original)
for _ in range(1000):
A = [random.randint(-100, 100) for _ in range(random.randint(0, 30))]
R = my_sort(A[:])
assert is_valid_sort(A, R)
No golden values. The properties alone distinguish valid from invalid sorts. This generalizes: a DP shortest-path algorithm can be property-tested by checking the returned distance equals a BFS oracle and the returned path is a valid edge sequence.
Example 3 (Hypothesis / property-based in Python).
from hypothesis import given, strategies as st
@given(st.lists(st.integers(), min_size=0, max_size=50))
def test_sort_sorts(A):
R = my_sort(A[:])
assert R == sorted(A)
assert Counter(R) == Counter(A)
Hypothesis will search a large space of inputs, shrink counterexamples automatically, and cache known failing inputs so regressions are caught immediately.
Common Confusion / Misconceptions
- "If all tests pass, the code is right." No - tests prove the presence of correctness on those inputs only. The sample inputs are curated to be illustrative, not adversarial.
- "I do not have time for stress testing." Writing a brute force and a small generator is usually less work than debugging a "Wrong Answer" verdict on input you cannot see. Treat stress testing as the faster path, not a luxury.
- "A manual edge-case list is as good as a random generator." Manually crafted cases cover the edges you thought of. Random generators find the ones you did not. Both are useful; random plus oracle is strictly stronger.
- "Property tests are only for functional programmers." The property-based style works in any language (Hypothesis in Python, jqwik in Java, QuickCheck in Haskell/Scala/JS). In production codebases they catch off-by-one bugs, unicode corner cases, and invariant violations that unit tests miss.
- "Brute force is slow; it won't find my bug." The brute-force oracle runs only on tiny inputs (size 8-30) where it is fast. Your algorithm should produce the same output on those inputs; if it doesn't, the bug is reproducible in milliseconds.
- "A passing test run means the test is good." Many tests pass on any implementation because they only check surface properties. A test that cannot distinguish your algorithm from the empty function is worthless.
How To Use It
Default checklist for any non-trivial algorithm:
- Run the provided examples.
- Manually craft edge cases. Empty, size 1, all-equal, monotonic, reverse-monotonic, max size, duplicates, overflow boundary, $k = 0$.
- Implement a brute force and a random generator for small sizes.
- Run the stress comparison for a few hundred rounds before trusting the code.
- Add property checks. Sortedness, sum preservation, permutation, constraint satisfaction.
- Log the seed for random generators; treat a failed seed as a test input that must never regress.
- On a bug, shrink the failing input to its minimum form before looking at code (concept 16).
In competitive programming, this is the stupid + smart pattern. In production, it is property-based testing (Hypothesis, fast-check, jqwik) plus a slow-but-correct reference implementation in the test suite.
Transfer / Where This Shows Up Later
- S1 M5 (problem solving) taught "test what you built." This is the structured version.
- S4 (systems) stress-tests lock-free data structures and memory allocators with random concurrent clients and an invariant-based oracle (linearizability checker).
- S5 (OS) fuzzes system calls and file system operations against a model implementation.
- S6 (databases) runs property-based consistency tests (read your writes, snapshot isolation, linearizability) via oracles like Jepsen.
- S7 (architecture) expresses fitness functions (S7 M5 concept 13) as automated tests that run on every PR: the architecture claim is the property; the fitness function is the oracle.
- S8 (distributed systems) fuzzes protocol implementations under chaos (network partitions, message reorderings) with TLA+/model-checking as the exhaustive oracle.
Concept 16 picks up when stress testing finds a failing input; concept 18 incorporates "is brute force available for an oracle?" into paradigm triage.
Check Yourself
- Why is a brute-force oracle more informative than a large hand-crafted test suite?
- Give three edge cases you would test for any sort implementation.
- What is a property-based test and how does it differ from a golden-output test?
- When stress testing reveals a failing input, what do you do before inspecting the code?
- If brute force is too slow to run on the contest's largest size, how do you still use it as an oracle?
Mini Drill or Application
For each scenario, design a stress test and list the edge cases.
- Your
binary_search(A, x)returns the index of $x$ or $-1$. - Your activity-selection greedy implementation.
- Your
min_coin_change(C, T)DP. - Your mergesort.
- Your $0/1$ knapsack DP.
- Implementation drill. Build a stress harness for
is_palindrome_subsequence(s, t)(check whether $t$ is a palindromic subsequence of $s$). Write:- A naive $O(|s|!)$ brute force (enumerate subsequences, check).
- Your efficient implementation (whatever paradigm).
- A generator that emits strings of length 0..10 over alphabet
"ab". Run 2000 rounds and log each mismatch. Complexity derivation: harness runs the smart algorithm $O(\text{rounds} \cdot T_\text{smart})$ and brute $O(\text{rounds} \cdot T_\text{brute})$; on $|s| \le 10$ both complete in under a second.
Read This Only If Stuck
- Competitive Programming: Tip 5 - Master the art of testing code
- Competitive Programming: Do algorithm analysis (sizing tests)
- Skiena: War story - Hard against the clock (testing pressure)
- Skiena: War story - And then I failed (testing lesson)
- Skiena: War story - Skiena for the defense
- Competitive Programming: Solutions to chapter 1 exercises
- Hypothesis (Python property-based testing) docs (the canonical Python library for property-based testing)
- CSES problem set (graded problems where stress-testing against brute force is the expected method)
- cs.stackexchange: Why stress test? (highest-voted discussions on oracle-based testing)