Track D: Observability, Reliability & SRE Overview
Purpose: Build operational awareness and reliability engineering practices that ensure systems work correctly in production
Philosophy: You cannot manage what you cannot measure - observability and reliability are core engineering disciplines
This track begins in Phase 5 when networked systems appear and continues through all production phases. Every service and system you build should include observability and reliability considerations from the design stage.
Track Progression by Phase
| Phase | Level | Focus | Skills Developed |
|---|---|---|---|
| Phase 5 | Level 1 | Basic observability | Logging, metrics, traces, basic monitoring for networked systems |
| Phase 6 | Level 1 | Reliability thinking | Failure modes, backpressure, recovery behavior, error budgets |
| Phase 7-8 | Level 2 | SRE fundamentals | SLIs, SLOs, error budgets, alert design, incident response |
| Phase 9 | Level 2 | Cloud-native observability | Managed monitoring services, infrastructure health, distributed tracing |
| Phase 10 | Level 3 | Production readiness | Full observability stack, operational dashboards, postmortem culture |
Level 1 Starter Guide: Basic Observability
Getting Started Today (45 minutes)
Add observability to your first networked program:
-
Structured logging setup:
import logging
import json
from datetime import datetime
# Configure structured logging
logging.basicConfig(
level=logging.INFO,
format='%(message)s'
)
logger = logging.getLogger(__name__)
def log_structured(level, message, **kwargs):
"""Log structured data for machine processing."""
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'level': level,
'message': message,
**kwargs
}
logger.info(json.dumps(log_entry))
# Usage in your network service
def handle_request(request):
log_structured('INFO', 'Request received',
method=request.method,
path=request.path,
client_ip=request.remote_addr)
try:
response = process_request(request)
log_structured('INFO', 'Request completed',
status_code=response.status_code,
response_time_ms=response.processing_time)
return response
except Exception as e:
log_structured('ERROR', 'Request failed',
error_type=type(e).__name__,
error_message=str(e))
raise -
Basic metrics collection:
from collections import Counter, defaultdict
import time
class SimpleMetrics:
def __init__(self):
self.counters = Counter()
self.gauges = {}
self.histograms = defaultdict(list)
def increment(self, metric_name, value=1, **labels):
key = f"{metric_name}_{'_'.join(f'{k}:{v}' for k, v in labels.items())}"
self.counters[key] += value
def gauge(self, metric_name, value, **labels):
key = f"{metric_name}_{'_'.join(f'{k}:{v}' for k, v in labels.items())}"
self.gauges[key] = value
def histogram(self, metric_name, value, **labels):
key = f"{metric_name}_{'_'.join(f'{k}:{v}' for k, v in labels.items())}"
self.histograms[key].append(value)
def report(self):
print("=== Metrics Report ===")
print("Counters:", dict(self.counters))
print("Gauges:", self.gauges)
for key, values in self.histograms.items():
avg = sum(values) / len(values) if values else 0
print(f"Histogram {key}: avg={avg:.2f}, count={len(values)}")
# Usage in your service
metrics = SimpleMetrics()
def timed_operation(operation_name):
def decorator(func):
def wrapper(*args, **kwargs):
start_time = time.time()
try:
result = func(*args, **kwargs)
duration = time.time() - start_time
metrics.histogram(f"{operation_name}_duration_seconds", duration)
metrics.increment(f"{operation_name}_total", labels={'status': 'success'})
return result
except Exception as e:
duration = time.time() - start_time
metrics.histogram(f"{operation_name}_duration_seconds", duration)
metrics.increment(f"{operation_name}_total", labels={'status': 'error'})
raise
return wrapper
return decorator
@timed_operation("database_query")
def query_user(user_id):
# Your database query implementation
pass -
Health check endpoint:
from flask import Flask, jsonify
app = Flask(__name__)
@app.route('/health')
def health_check():
"""Basic health check endpoint for monitoring."""
health_status = {
'status': 'healthy',
'timestamp': datetime.utcnow().isoformat(),
'version': '1.0.0',
'dependencies': {}
}
# Check critical dependencies
try:
# Test database connection
db_status = check_database_connection()
health_status['dependencies']['database'] = 'healthy' if db_status else 'unhealthy'
except Exception:
health_status['dependencies']['database'] = 'unhealthy'
health_status['status'] = 'degraded'
return jsonify(health_status)
@app.route('/metrics')
def metrics_endpoint():
"""Expose metrics for monitoring system scraping."""
return jsonify({
'counters': dict(metrics.counters),
'gauges': metrics.gauges,
'histograms': {k: len(v) for k, v in metrics.histograms.items()}
})
Core Observability Concepts (Learn These First)
The Three Pillars of Observability:
-
Logs: Detailed records of system events and behaviors
- When to use: Debugging specific issues, audit trails, detailed troubleshooting
- Best practices: Structured logging, appropriate log levels, correlation IDs
-
Metrics: Numerical measurements of system performance and behavior
- When to use: Monitoring trends, alerting on thresholds, capacity planning
- Types: Counters (events), gauges (current state), histograms (distributions)
-
Traces: Records of requests flowing through distributed systems
- When to use: Understanding request flow, identifying bottlenecks in complex systems
- Best practices: Distributed tracing, span relationships, timing analysis
Reliability Concepts:
-
Service Level Indicators (SLIs): Measurements of service performance
- Examples: Response time, error rate, throughput, availability
-
Service Level Objectives (SLOs): Targets for SLI performance
- Examples: "99% of requests complete within 100ms", "99.9% uptime per month"
-
Error Budgets: Acceptable level of failures based on SLO targets
- Example: 99.9% uptime allows 43 minutes of downtime per month
Practical Monitoring Techniques by System Type
Single-Process Applications (Phase 5)
Application monitoring:
import psutil
import os
def get_system_metrics():
"""Collect basic system performance metrics."""
process = psutil.Process(os.getpid())
return {
'cpu_percent': process.cpu_percent(),
'memory_mb': process.memory_info().rss / 1024 / 1024,
'open_files': process.num_fds() if hasattr(process, 'num_fds') else 0,
'threads': process.num_threads(),
'uptime_seconds': time.time() - process.create_time()
}
def monitor_application():
"""Simple monitoring loop for development."""
while True:
metrics = get_system_metrics()
log_structured('INFO', 'System metrics', **metrics)
# Simple alerting
if metrics['memory_mb'] > 512:
log_structured('WARN', 'High memory usage',
memory_mb=metrics['memory_mb'])
time.sleep(60) # Check every minute
Network Services (Phase 6-7)
HTTP service monitoring:
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
# Prometheus-compatible metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'HTTP request duration')
ACTIVE_CONNECTIONS = Gauge('http_active_connections', 'Active HTTP connections')
def monitor_request(func):
"""Decorator to monitor HTTP request metrics."""
def wrapper(*args, **kwargs):
start_time = time.time()
ACTIVE_CONNECTIONS.inc()
try:
response = func(*args, **kwargs)
REQUEST_COUNT.labels(method=request.method,
endpoint=request.endpoint,
status=response.status_code).inc()
return response
except Exception as e:
REQUEST_COUNT.labels(method=request.method,
endpoint=request.endpoint,
status=500).inc()
raise
finally:
REQUEST_DURATION.observe(time.time() - start_time)
ACTIVE_CONNECTIONS.dec()
return wrapper
@app.route('/metrics')
def metrics():
"""Expose metrics in Prometheus format."""
return generate_latest()
Distributed Systems (Phase 8-9)
Distributed tracing setup:
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure distributed tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
def traced_database_operation(query):
"""Database operation with distributed tracing."""
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("query.type", "SELECT")
span.set_attribute("query.table", "users")
try:
result = execute_query(query)
span.set_attribute("query.rows_returned", len(result))
span.set_status(trace.Status(trace.StatusCode.OK))
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
span.record_exception(e)
raise
Building Your Reliability Mindset
Questions for Every System Component
Before building:
- What could fail? (Failure mode analysis)
- How will we know if it's failing? (Monitoring strategy)
- How quickly do we need to detect failures? (Alerting requirements)
- What's the acceptable failure rate? (SLO definition)
- How do we recover when things go wrong? (Incident response planning)
During operation:
- Is the system behaving as expected? (Continuous monitoring)
- Are we within our error budget? (SLO tracking)
- What patterns indicate emerging problems? (Trend analysis)
- How do we improve system reliability? (Continuous improvement)
Incident Response Basics (Phase 8+)
Simple incident response framework:
-
Detection: How do you know something is wrong?
- Automated alerting based on SLI thresholds
- User reports and external monitoring
- Routine health checks and proactive monitoring
-
Response: What do you do when something breaks?
- Immediate mitigation to restore service
- Root cause investigation and diagnosis
- Communication to affected users and stakeholders
-
Recovery: How do you prevent it from happening again?
- Postmortem analysis without blame
- System improvements and preventive measures
- Documentation and knowledge sharing
-
Learning: How does this improve your systems?
- Update monitoring and alerting based on incident learnings
- Improve system design to prevent similar failures
- Share knowledge across team and organization
Postmortem Template (Start Using in Phase 6)
# Incident Postmortem: [Brief Description]
## Summary
- **Date:** [When incident occurred]
- **Duration:** [How long service was impacted]
- **Impact:** [What users/systems were affected]
- **Root cause:** [What actually caused the problem]
## Timeline
- **[Time]:** [What happened]
- **[Time]:** [Response actions taken]
- **[Time]:** [Service restored]
## What Went Well
- [Positive aspects of response]
## What Could Be Improved
- [Areas for better response]
## Action Items
- [ ] [Specific improvement with owner and deadline]
- [ ] [Process change with implementation plan]
## Lessons Learned
- [Technical insights for future design]
- [Operational insights for future response]
Essential Observability Tools by Phase
Phase 5: Basic Monitoring
- Python logging module: Structured application logging
- Grafana + InfluxDB: Time-series metrics visualization
- Simple alerting: Email or Slack notifications for critical thresholds
Phase 6-8: Service Monitoring
- Prometheus: Industry-standard metrics collection and alerting
- ELK Stack (Elasticsearch, Logstash, Kibana): Log aggregation and analysis
- Jaeger or Zipkin: Distributed tracing for service interactions
- PagerDuty or Opsgenie: Professional incident management and escalation
Phase 9-10: Cloud-Native Observability
- Cloud provider monitoring: AWS CloudWatch, GCP Monitoring, Azure Monitor
- Kubernetes monitoring: Prometheus + Grafana for container orchestration
- APM tools: New Relic, Datadog, or AppDynamics for comprehensive application performance monitoring
- Infrastructure monitoring: Terraform state monitoring, cloud resource health tracking
Reliability Engineering Fundamentals
Defining Service Level Objectives
Choose meaningful SLIs for your services:
-
Request/Response Services:
- Availability: Percentage of successful responses (200-299 status codes)
- Latency: 95th percentile response time under specified load
- Throughput: Requests per second the service can handle sustainably
-
Data Processing Services:
- Completeness: Percentage of input records processed successfully
- Freshness: Time between data arrival and processing completion
- Accuracy: Percentage of outputs that match expected results
-
Infrastructure Services:
- Uptime: Percentage of time service is available and responsive
- Resource utilization: CPU, memory, disk usage within acceptable ranges
- Error rates: Percentage of operations that complete without errors
Example SLO definition:
service: user-authentication-api
slos:
availability:
sli: "Percentage of HTTP requests returning 200-299 status codes"
target: "99.9% over rolling 30-day window"
error_budget: "0.1% (43.2 minutes per month)"
latency:
sli: "95th percentile response time for authentication requests"
target: "< 200ms under normal load"
measurement: "Exclude cold starts and scheduled maintenance"
throughput:
sli: "Sustained requests per second without degradation"
target: "> 1000 RPS with < 5% error rate increase"
Error Budget Management
Error budget calculation:
- SLO target: 99.9% availability
- Error budget: 0.1% = 43.2 minutes downtime per month
- Burn rate: How quickly you're consuming the error budget
Decision framework:
- Budget remaining > 50%: Focus on feature development, acceptable to take risks
- Budget remaining 10-50%: Balance reliability improvements with feature work
- Budget remaining < 10%: Prioritize reliability, defer risky feature releases
Practical Monitoring Implementation
Phase 5: Network Service Monitoring
Basic Flask service with monitoring:
from flask import Flask, request, jsonify
import time
import logging
app = Flask(__name__)
metrics = SimpleMetrics() # From previous example
@app.route('/api/users/<user_id>')
@monitor_request # Decorator from previous example
def get_user(user_id):
"""Get user information with observability."""
span_start = time.time()
try:
# Validate input
if not user_id.isdigit():
metrics.increment('user_requests_total', labels={'status': 'invalid_input'})
return jsonify({'error': 'Invalid user ID'}), 400
# Process request with timing
with tracer.start_as_current_span("database_lookup"):
user_data = database.get_user(int(user_id))
if not user_data:
metrics.increment('user_requests_total', labels={'status': 'not_found'})
return jsonify({'error': 'User not found'}), 404
# Success metrics
processing_time = time.time() - span_start
metrics.histogram('request_duration_seconds', processing_time)
metrics.increment('user_requests_total', labels={'status': 'success'})
return jsonify(user_data)
except Exception as e:
# Error handling with observability
processing_time = time.time() - span_start
metrics.histogram('request_duration_seconds', processing_time)
metrics.increment('user_requests_total', labels={'status': 'error'})
logger.error(f"Unexpected error in get_user: {e}",
extra={'user_id': user_id, 'error': str(e)})
return jsonify({'error': 'Internal server error'}), 500
if __name__ == '__main__':
# Start metrics reporting
import threading
def report_metrics():
while True:
time.sleep(60)
metrics.report()
metrics_thread = threading.Thread(target=report_metrics)
metrics_thread.daemon = True
metrics_thread.start()
app.run(host='0.0.0.0', port=5000)
Phase 6-8: Distributed System Observability
Service mesh observability (conceptual):
- Traffic management: Monitor request routing, load balancing, circuit breaking
- Security monitoring: Track authentication, authorization, and encryption status
- Performance monitoring: End-to-end request tracing across multiple services
- Dependency analysis: Understanding service relationships and failure propagation
Database monitoring integration:
import sqlalchemy
from sqlalchemy import event
from sqlalchemy.engine import Engine
# Monitor database operations
@event.listens_for(Engine, "before_cursor_execute")
def before_cursor_execute(conn, cursor, statement, parameters, context, executemany):
context._query_start_time = time.time()
metrics.increment('database_queries_total', labels={'operation': statement.split()[0].upper()})
@event.listens_for(Engine, "after_cursor_execute")
def after_cursor_execute(conn, cursor, statement, parameters, context, executemany):
total_time = time.time() - context._query_start_time
metrics.histogram('database_query_duration_seconds', total_time)
Phase 9-10: Production Monitoring Stack
Cloud-native monitoring architecture:
- Application metrics: Custom business metrics plus infrastructure metrics
- Infrastructure monitoring: Cloud resource utilization, network performance, storage health
- Security monitoring: Access patterns, authentication failures, potential security threats
- Cost monitoring: Resource usage costs and optimization opportunities
Building Your SRE Mindset
Operational Questions for Every Design Decision
- How will we know if this is working correctly in production?
- What metrics indicate this component is healthy vs. unhealthy?
- How do we detect when this component starts to degrade before it fails completely?
- What's our plan for restoring service when this component fails?
- How do we prevent this type of failure from happening again?
Balancing Reliability and Innovation
Reliability vs. Feature Development Trade-offs:
- High error budget remaining: Take calculated risks on new features
- Low error budget remaining: Focus on reliability improvements and technical debt
- Error budget exceeded: Stop feature development, focus entirely on reliability
Cultural Practices:
- Blameless postmortems: Focus on system improvement rather than individual fault
- Error budget reviews: Regular assessment of reliability vs. innovation balance
- Reliability targets: Explicit SLOs that guide engineering priorities
Resources for Continued Learning
Books by Phase
- Phase 5: The Art of Monitoring - Fundamental monitoring concepts and practices
- Phase 6-7: Monitoring and Observability - Comprehensive guide to observability engineering
- Phase 8: Site Reliability Engineering - Google's approach to production reliability
- Phase 9-10: The Site Reliability Workbook - Practical SRE implementation guidance
Online Resources
- Prometheus documentation - Industry-standard metrics and alerting
- Grafana tutorials - Visualization and dashboard creation
- SRE resources - Google SRE books, blog posts, and case studies
- Cloud provider monitoring guides - Platform-specific observability tools
Practical Exercises
- Monitor a personal project - Add comprehensive observability to something you've built
- Simulate failures - Practice incident response with controlled system failures
- Build dashboards - Create operational dashboards for systems you understand well
- Write runbooks - Document operational procedures for services you maintain
Community and Learning
- SRE community forums - Learning from production engineering experiences
- Conference talks - SRECon, Monitorama, and cloud provider conferences
- Open source monitoring - Contribute to observability tools and share learnings
- Professional development - Pursue SRE or DevOps roles that apply these skills professionally
Remember: Observability is not just tooling - it's a design philosophy that makes systems understandable and manageable in production. Every system you build should answer the question "How do we know if this is working correctly?" from the beginning of the design process.