Testing Beyond the Given Examples

What This Concept Is

"It passes the sample inputs" is the weakest possible evidence of correctness. Serious testing of an algorithm has four layers:

Provided examples. The examples in the problem statement. Necessary, not sufficient.
Edge cases. Empty input, size 1, all-equal elements, maximum allowed size, negative values, duplicates, already-sorted or reverse-sorted, overflow boundaries, inputs that exercise every branch of every if.
Stress testing. Generate random inputs, run your algorithm and a brute-force oracle, compare outputs. Exhaustive for small inputs; randomized for larger. The oracle is the key piece.
Property-based tests. Check structural invariants of the output - is the result sorted? is it a permutation of the input? does it satisfy the constraint? - not just equality with a golden answer.

This is the inverse of unit testing's "pick a few cases." Adversarial property-plus-oracle testing gives high confidence on the infinite family of inputs at the cost of a slightly slower but always-correct oracle.

Why It Matters Here

Algorithmic bugs often live where nobody looks: one-element arrays, ties, overflow at $n = 10^9$, duplicate keys in a "unique" assumption, $k = 0$ in a top-$k$ query, empty matches in a regex. Passing the problem's three visible tests predicts almost nothing about these.

Stress testing with a brute-force oracle is the single highest-ROI technique for catching algorithmic bugs that reach production or a graded submission. For every ADR you write in S7 that says "we chose algorithm X," the underlying fitness function is "did X agree with the oracle on $10^4$ random inputs?"

Testing is also where algorithmic confidence is built: an engineer who consistently runs stress tests before committing does not need to re-derive correctness every time they refactor; the oracle will tell them if something broke.

Concrete Example

Example 1 (generic stress harness).

import random

def stress(my, brute, generator, rounds=500):
    for _ in range(rounds):
        inp = generator()
        got = my(inp)
        want = brute(inp)
        if got != want:
            print("Mismatch on input:", inp)
            print("my:", got, "brute:", want)
            return False
    return True

gen = lambda: [random.randint(-10, 10) for _ in range(random.randint(0, 8))]
assert stress(my_solution, brute_solution, gen)

This pattern has caught more real bugs for more real students than every edge-case checklist combined. The key is to keep the inputs small enough that brute force is instant and varied enough to hit corner cases the examples missed.

Additional hygiene:

Shrink failing inputs. On mismatch, repeatedly remove one element and re-run; keep the smallest input that still fails. (QuickCheck and Hypothesis do this automatically.)
Seed the RNG. Reproducibility matters. Log the seed so a failure can be replayed deterministically.
Escalate size. Once 500 rounds pass at $n \le 8$, run 500 more at $n \le 50$, and so on.

Example 2 (property-based test for sort).

from collections import Counter

def is_valid_sort(original, result):
    return result == sorted(result) and Counter(result) == Counter(original)

for _ in range(1000):
    A = [random.randint(-100, 100) for _ in range(random.randint(0, 30))]
    R = my_sort(A[:])
    assert is_valid_sort(A, R)

No golden values. The properties alone distinguish valid from invalid sorts. This generalizes: a DP shortest-path algorithm can be property-tested by checking the returned distance equals a BFS oracle and the returned path is a valid edge sequence.

Example 3 (Hypothesis / property-based in Python).

from hypothesis import given, strategies as st

@given(st.lists(st.integers(), min_size=0, max_size=50))
def test_sort_sorts(A):
    R = my_sort(A[:])
    assert R == sorted(A)
    assert Counter(R) == Counter(A)

Hypothesis will search a large space of inputs, shrink counterexamples automatically, and cache known failing inputs so regressions are caught immediately.

Common Confusion / Misconceptions

"If all tests pass, the code is right." No - tests prove the presence of correctness on those inputs only. The sample inputs are curated to be illustrative, not adversarial.
"I do not have time for stress testing." Writing a brute force and a small generator is usually less work than debugging a "Wrong Answer" verdict on input you cannot see. Treat stress testing as the faster path, not a luxury.
"A manual edge-case list is as good as a random generator." Manually crafted cases cover the edges you thought of. Random generators find the ones you did not. Both are useful; random plus oracle is strictly stronger.
"Property tests are only for functional programmers." The property-based style works in any language (Hypothesis in Python, jqwik in Java, QuickCheck in Haskell/Scala/JS). In production codebases they catch off-by-one bugs, unicode corner cases, and invariant violations that unit tests miss.
"Brute force is slow; it won't find my bug." The brute-force oracle runs only on tiny inputs (size 8-30) where it is fast. Your algorithm should produce the same output on those inputs; if it doesn't, the bug is reproducible in milliseconds.
"A passing test run means the test is good." Many tests pass on any implementation because they only check surface properties. A test that cannot distinguish your algorithm from the empty function is worthless.

How To Use It

Default checklist for any non-trivial algorithm:

Run the provided examples.
Manually craft edge cases. Empty, size 1, all-equal, monotonic, reverse-monotonic, max size, duplicates, overflow boundary, $k = 0$.
Implement a brute force and a random generator for small sizes.
Run the stress comparison for a few hundred rounds before trusting the code.
Add property checks. Sortedness, sum preservation, permutation, constraint satisfaction.
Log the seed for random generators; treat a failed seed as a test input that must never regress.
On a bug, shrink the failing input to its minimum form before looking at code (concept 16).

In competitive programming, this is the stupid + smart pattern. In production, it is property-based testing (Hypothesis, fast-check, jqwik) plus a slow-but-correct reference implementation in the test suite.

Transfer / Where This Shows Up Later

S1 M5 (problem solving) taught "test what you built." This is the structured version.
S4 (systems) stress-tests lock-free data structures and memory allocators with random concurrent clients and an invariant-based oracle (linearizability checker).
S5 (OS) fuzzes system calls and file system operations against a model implementation.
S6 (databases) runs property-based consistency tests (read your writes, snapshot isolation, linearizability) via oracles like Jepsen.
S7 (architecture) expresses fitness functions (S7 M5 concept 13) as automated tests that run on every PR: the architecture claim is the property; the fitness function is the oracle.
S8 (distributed systems) fuzzes protocol implementations under chaos (network partitions, message reorderings) with TLA+/model-checking as the exhaustive oracle.

Concept 16 picks up when stress testing finds a failing input; concept 18 incorporates "is brute force available for an oracle?" into paradigm triage.

Check Yourself

Why is a brute-force oracle more informative than a large hand-crafted test suite?
Give three edge cases you would test for any sort implementation.
What is a property-based test and how does it differ from a golden-output test?
When stress testing reveals a failing input, what do you do before inspecting the code?
If brute force is too slow to run on the contest's largest size, how do you still use it as an oracle?

Mini Drill or Application

For each scenario, design a stress test and list the edge cases.

Your binary_search(A, x) returns the index of $x$ or $-1$.
Your activity-selection greedy implementation.
Your min_coin_change(C, T) DP.
Your mergesort.
Your $0/1$ knapsack DP.
Implementation drill. Build a stress harness for is_palindrome_subsequence(s, t) (check whether $t$ is a palindromic subsequence of $s$). Write:
- A naive $O(|s|!)$ brute force (enumerate subsequences, check).
- Your efficient implementation (whatever paradigm).
- A generator that emits strings of length 0..10 over alphabet "ab". Run 2000 rounds and log each mismatch. Complexity derivation: harness runs the smart algorithm $O(\text{rounds} \cdot T_\text{smart})$ and brute $O(\text{rounds} \cdot T_\text{brute})$; on $|s| \le 10$ both complete in under a second.

Read This Only If Stuck

Competitive Programming: Tip 5 - Master the art of testing code
Competitive Programming: Do algorithm analysis (sizing tests)
Skiena: War story - Hard against the clock (testing pressure)
Skiena: War story - And then I failed (testing lesson)
Skiena: War story - Skiena for the defense
Competitive Programming: Solutions to chapter 1 exercises
Hypothesis (Python property-based testing) docs (the canonical Python library for property-based testing)
CSES problem set (graded problems where stress-testing against brute force is the expected method)
cs.stackexchange: Why stress test? (highest-voted discussions on oracle-based testing)

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconceptions​

How To Use It​

Transfer / Where This Shows Up Later​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​