Skip to main content

On-Call Hygiene for a Solo Operator

What This Concept Is

On-call at scale is a rotation with coverage, handoff, and sanity rules. On-call for a capstone is usually one person: you. That changes which rules matter, but does not remove the need for a posture.

The solo-operator posture has three non-negotiables:

  • Decide what is actually page-worthy. Not every alert is paging-eligible; most are ticket-eligible.
  • Set hours you are reachable and hours you are not. A capstone does not need 24/7 coverage. It needs published coverage.
  • Have a fallback. Exactly one: a peer, a reviewer, or a documented "service will be up at 9 a.m." policy.

Hygiene is about making this sustainable. A solo operator who runs themselves into the ground is no longer a good operator by week two. The SRE community explicitly documents this as a reliability concern: sustainable on-call is itself a reliability practice, not a wellness one.

The PagerDuty and Google SRE literature both emphasise a single working number: operators should spend less than 25-50% of their time on operational toil. For a solo capstone operator the same number is a survival constraint rather than a productivity one. If the system pages you three times a week, either the alerts are wrong or the system is wrong -- both are fixable before the operator becomes the bottleneck.

Why It Matters Here (In the Capstone)

The PRR wants to know: when this thing breaks, who answers? "Whoever happens to be on their phone" is not an answer. The cost of having no written posture is that you will burn out, or you will treat every alert like a drill, or you will ignore them -- all three lead to the same place.

This concept is marked supporting because it matters less than the SLO, the runbook, or the threat model. But a PRR can still fail on this line if the answer is unconvincing.

Concrete Example -- from a real capstone

One-page on-call policy for the capstone. File at library/raw/on-call.md.

Coverage

  • Primary: you, M-F 09:00-20:00 local. Phone on.
  • Off-hours (nights + weekends): reduced coverage. Fast-burn pages still fire; acknowledge within 60 min.
  • Vacation / holidays: system runs, no guaranteed human. Any incident resolves next business day.
  • Documented on the status page: users know the hours.

Page vs ticket

Every alert is one of:

  • Page (phone, SMS, push): fast-burn SLO, security alert with user-visible impact, total outage. Maximum 2 per week target; if exceeded, root-cause the noise, not the operator.
  • Ticket (email / queue only): slow-burn SLO, capacity thresholds, stale-cert warnings at 30 days, unresolved dependency vulns. Reviewed daily during business hours.

Handling a page (solo)

  1. Acknowledge within 10 min. Open the runbook for the fired alert.
  2. Stabilize first, diagnose second. Follow runbook mitigations until impact subsides.
  3. Communicate. If more than 15 min, post a one-line update in the incident channel or status page: "Investigating elevated error rate on /webhook, started 14:03."
  4. Do the minimum now; fix properly during business hours. Mitigate tonight, refactor tomorrow.
  5. Short postmortem within 48h, even for minor incidents. Template below.

Fallback

  • If you cannot respond within 60 min of a page, Dr. X has the pager escalation. Their expectation: safe-mode the system (flip the kill switch), do not attempt repair.
  • Kill switch documented in library/raw/runbooks/safe-mode.md: one command, readable by one other person.

Sanity rules

  • No deploys in the last hour of coverage.
  • No risky deploys on Friday afternoon. Ever.
  • After any page, log the incident, even if trivial. "False page: disk warning at 70%" is an incident -- the fix is deleting the alert.
  • Review the alert list monthly. Any alert that has not fired in 180 days is either deleted or explicitly justified.

Common Confusion / Misconceptions

"I'll just always be on call; it's my project." Always-on is a guarantee you will eventually break. Published hours are a realistic contract with yourself and with users.

"I need a full rotation like Google." You do not. A rotation of one with explicit hours is a valid rotation. The published hours are what matters, not the count of people.

"Postmortems are for big incidents." Mini-postmortems (4-6 lines) after every incident -- even false pages -- are the highest-leverage practice for a solo operator. They are how the runbooks get better and the alert list shrinks.

"I'll add more alerts to catch things earlier." Solo operators drown in alert volume faster than anyone. The correct move on a noisy alert is usually delete, not add. If you cannot safely delete, escalate the ticket threshold instead.

"Being on-call means no life." If that is true, the posture is wrong. Reduce coverage hours, tune alerts harder, or accept the SLO has to be looser. Burnout is a reliability risk too.

How To Use It (In Your Capstone)

  1. Write library/raw/on-call.md using the example above as a skeleton.
  2. Publish the coverage hours somewhere users can see (status page, README, landing page).
  3. Decide which alerts page and which ticket. Audit: any alert that has fired in the last 30 days and is not on the page list should be demoted.
  4. Define the kill switch (safe-mode runbook). One command. Reviewed with your fallback person.
  5. Use the mini-postmortem template for every incident; commit them to library/raw/postmortems/.
  6. Review the alert list monthly: count pages, count false pages, count suppressed tickets. Tune.
  7. Link the posture document from the PRR checklist.

Mini-postmortem template (keep it short)

# Postmortem: <slug>  (YYYY-MM-DD)
- Impact: <1 sentence>
- Duration: <minutes>
- Budget consumed: <% of 30-day>
- What happened: <2-3 sentences>
- What worked: <1-2 bullets>
- What didn't: <1-2 bullets>
- Action items (with owner + due date): <3 max>

See also (integrative)

Check Yourself

  1. Why is publishing coverage hours part of a valid solo-operator on-call posture?
  2. What single change should you make after a "false page" incident?
  3. Why is a kill switch documented for a peer more valuable than a detailed manual for yourself?
  4. Why does a mini-postmortem after every page (including false ones) matter more for solo operators than for larger rotations?
  5. What is the maximum sustainable page rate per week for a solo capstone operator, and what should you do when it is exceeded?

Mini Drill or Application (Capstone-scoped)

  1. Draft library/raw/on-call.md with your coverage hours, page-vs-ticket rules, and fallback.
  2. List every alert currently paging. For each, mark page / ticket / delete. Commit the reclassification.
  3. Define the kill switch as a single runbook with one command. Test it in staging.
  4. Write the mini-postmortem template into library/raw/postmortems/_template.md.
  5. Review the last 30 days of incidents (even minor ones). Count pages; count false pages; set a target for next month.
  6. Commit the whole bundle and link it from the PRR.

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.