Track D: Observability, Reliability & SRE Overview

Purpose: Build operational awareness and reliability engineering practices that ensure systems work correctly in production
Philosophy: You cannot manage what you cannot measure - observability and reliability are core engineering disciplines

Track Integration

This track begins in Phase 5 when networked systems appear and continues through all production phases. Every service and system you build should include observability and reliability considerations from the design stage.

Track Progression by Phase

Phase	Level	Focus	Skills Developed
Phase 5	Level 1	Basic observability	Logging, metrics, traces, basic monitoring for networked systems
Phase 6	Level 1	Reliability thinking	Failure modes, backpressure, recovery behavior, error budgets
Phase 7-8	Level 2	SRE fundamentals	SLIs, SLOs, error budgets, alert design, incident response
Phase 9	Level 2	Cloud-native observability	Managed monitoring services, infrastructure health, distributed tracing
Phase 10	Level 3	Production readiness	Full observability stack, operational dashboards, postmortem culture

Level 1 Starter Guide: Basic Observability

Getting Started Today (45 minutes)

Add observability to your first networked program:

Structured logging setup:

import logging
import json
from datetime import datetime

# Configure structured logging
logging.basicConfig(
    level=logging.INFO,
    format='%(message)s'
)
logger = logging.getLogger(__name__)

def log_structured(level, message, **kwargs):
    """Log structured data for machine processing."""
    log_entry = {
        'timestamp': datetime.utcnow().isoformat(),
        'level': level,
        'message': message,
        **kwargs
    }
    logger.info(json.dumps(log_entry))

# Usage in your network service
def handle_request(request):
    log_structured('INFO', 'Request received', 
                  method=request.method, 
                  path=request.path,
                  client_ip=request.remote_addr)
    
    try:
        response = process_request(request)
        log_structured('INFO', 'Request completed',
                      status_code=response.status_code,
                      response_time_ms=response.processing_time)
        return response
    except Exception as e:
        log_structured('ERROR', 'Request failed',
                      error_type=type(e).__name__,
                      error_message=str(e))
        raise

Basic metrics collection:

from collections import Counter, defaultdict
import time

class SimpleMetrics:
    def __init__(self):
        self.counters = Counter()
        self.gauges = {}
        self.histograms = defaultdict(list)
    
    def increment(self, metric_name, value=1, **labels):
        key = f"{metric_name}_{'_'.join(f'{k}:{v}' for k, v in labels.items())}"
        self.counters[key] += value
    
    def gauge(self, metric_name, value, **labels):
        key = f"{metric_name}_{'_'.join(f'{k}:{v}' for k, v in labels.items())}"
        self.gauges[key] = value
    
    def histogram(self, metric_name, value, **labels):
        key = f"{metric_name}_{'_'.join(f'{k}:{v}' for k, v in labels.items())}"
        self.histograms[key].append(value)
    
    def report(self):
        print("=== Metrics Report ===")
        print("Counters:", dict(self.counters))
        print("Gauges:", self.gauges)
        for key, values in self.histograms.items():
            avg = sum(values) / len(values) if values else 0
            print(f"Histogram {key}: avg={avg:.2f}, count={len(values)}")

# Usage in your service
metrics = SimpleMetrics()

def timed_operation(operation_name):
    def decorator(func):
        def wrapper(*args, **kwargs):
            start_time = time.time()
            try:
                result = func(*args, **kwargs)
                duration = time.time() - start_time
                metrics.histogram(f"{operation_name}_duration_seconds", duration)
                metrics.increment(f"{operation_name}_total", labels={'status': 'success'})
                return result
            except Exception as e:
                duration = time.time() - start_time  
                metrics.histogram(f"{operation_name}_duration_seconds", duration)
                metrics.increment(f"{operation_name}_total", labels={'status': 'error'})
                raise
        return wrapper
    return decorator

@timed_operation("database_query")
def query_user(user_id):
    # Your database query implementation
    pass

Health check endpoint:

from flask import Flask, jsonify

app = Flask(__name__)

@app.route('/health')
def health_check():
    """Basic health check endpoint for monitoring."""
    health_status = {
        'status': 'healthy',
        'timestamp': datetime.utcnow().isoformat(),
        'version': '1.0.0',
        'dependencies': {}
    }
    
    # Check critical dependencies
    try:
        # Test database connection
        db_status = check_database_connection()
        health_status['dependencies']['database'] = 'healthy' if db_status else 'unhealthy'
    except Exception:
        health_status['dependencies']['database'] = 'unhealthy'
        health_status['status'] = 'degraded'
    
    return jsonify(health_status)

@app.route('/metrics')  
def metrics_endpoint():
    """Expose metrics for monitoring system scraping."""
    return jsonify({
        'counters': dict(metrics.counters),
        'gauges': metrics.gauges,
        'histograms': {k: len(v) for k, v in metrics.histograms.items()}
    })

Core Observability Concepts (Learn These First)

The Three Pillars of Observability:

Logs: Detailed records of system events and behaviors
- When to use: Debugging specific issues, audit trails, detailed troubleshooting
- Best practices: Structured logging, appropriate log levels, correlation IDs
Metrics: Numerical measurements of system performance and behavior
- When to use: Monitoring trends, alerting on thresholds, capacity planning
- Types: Counters (events), gauges (current state), histograms (distributions)
Traces: Records of requests flowing through distributed systems
- When to use: Understanding request flow, identifying bottlenecks in complex systems
- Best practices: Distributed tracing, span relationships, timing analysis

Reliability Concepts:

Service Level Indicators (SLIs): Measurements of service performance
- Examples: Response time, error rate, throughput, availability
Service Level Objectives (SLOs): Targets for SLI performance
- Examples: "99% of requests complete within 100ms", "99.9% uptime per month"
Error Budgets: Acceptable level of failures based on SLO targets
- Example: 99.9% uptime allows 43 minutes of downtime per month

Practical Monitoring Techniques by System Type

Single-Process Applications (Phase 5)

Application monitoring:

import psutil
import os

def get_system_metrics():
    """Collect basic system performance metrics."""
    process = psutil.Process(os.getpid())
    
    return {
        'cpu_percent': process.cpu_percent(),
        'memory_mb': process.memory_info().rss / 1024 / 1024,
        'open_files': process.num_fds() if hasattr(process, 'num_fds') else 0,
        'threads': process.num_threads(),
        'uptime_seconds': time.time() - process.create_time()
    }

def monitor_application():
    """Simple monitoring loop for development."""
    while True:
        metrics = get_system_metrics()
        log_structured('INFO', 'System metrics', **metrics)
        
        # Simple alerting
        if metrics['memory_mb'] > 512:
            log_structured('WARN', 'High memory usage', 
                          memory_mb=metrics['memory_mb'])
        
        time.sleep(60)  # Check every minute

Network Services (Phase 6-7)

HTTP service monitoring:

from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time

# Prometheus-compatible metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'HTTP request duration')
ACTIVE_CONNECTIONS = Gauge('http_active_connections', 'Active HTTP connections')

def monitor_request(func):
    """Decorator to monitor HTTP request metrics."""
    def wrapper(*args, **kwargs):
        start_time = time.time()
        ACTIVE_CONNECTIONS.inc()
        
        try:
            response = func(*args, **kwargs)
            REQUEST_COUNT.labels(method=request.method, 
                                endpoint=request.endpoint,
                                status=response.status_code).inc()
            return response
        except Exception as e:
            REQUEST_COUNT.labels(method=request.method,
                                endpoint=request.endpoint, 
                                status=500).inc()
            raise
        finally:
            REQUEST_DURATION.observe(time.time() - start_time)
            ACTIVE_CONNECTIONS.dec()
    
    return wrapper

@app.route('/metrics')
def metrics():
    """Expose metrics in Prometheus format."""
    return generate_latest()

Distributed Systems (Phase 8-9)

Distributed tracing setup:

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure distributed tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)

span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

def traced_database_operation(query):
    """Database operation with distributed tracing."""
    with tracer.start_as_current_span("database_query") as span:
        span.set_attribute("query.type", "SELECT")
        span.set_attribute("query.table", "users")
        
        try:
            result = execute_query(query)
            span.set_attribute("query.rows_returned", len(result))
            span.set_status(trace.Status(trace.StatusCode.OK))
            return result
        except Exception as e:
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            span.record_exception(e)
            raise

Building Your Reliability Mindset

Questions for Every System Component

Before building:

What could fail? (Failure mode analysis)
How will we know if it's failing? (Monitoring strategy)
How quickly do we need to detect failures? (Alerting requirements)
What's the acceptable failure rate? (SLO definition)
How do we recover when things go wrong? (Incident response planning)

During operation:

Is the system behaving as expected? (Continuous monitoring)
Are we within our error budget? (SLO tracking)
What patterns indicate emerging problems? (Trend analysis)
How do we improve system reliability? (Continuous improvement)

Incident Response Basics (Phase 8+)

Simple incident response framework:

Detection: How do you know something is wrong?
- Automated alerting based on SLI thresholds
- User reports and external monitoring
- Routine health checks and proactive monitoring
Response: What do you do when something breaks?
- Immediate mitigation to restore service
- Root cause investigation and diagnosis
- Communication to affected users and stakeholders
Recovery: How do you prevent it from happening again?
- Postmortem analysis without blame
- System improvements and preventive measures
- Documentation and knowledge sharing
Learning: How does this improve your systems?
- Update monitoring and alerting based on incident learnings
- Improve system design to prevent similar failures
- Share knowledge across team and organization

Postmortem Template (Start Using in Phase 6)

# Incident Postmortem: [Brief Description]

## Summary
- **Date:** [When incident occurred]
- **Duration:** [How long service was impacted] 
- **Impact:** [What users/systems were affected]
- **Root cause:** [What actually caused the problem]

## Timeline  
- **[Time]:** [What happened]
- **[Time]:** [Response actions taken]
- **[Time]:** [Service restored]

## What Went Well
- [Positive aspects of response]

## What Could Be Improved  
- [Areas for better response]

## Action Items
- [ ] [Specific improvement with owner and deadline]
- [ ] [Process change with implementation plan]

## Lessons Learned
- [Technical insights for future design]
- [Operational insights for future response]

Essential Observability Tools by Phase

Phase 5: Basic Monitoring

Python logging module: Structured application logging
Grafana + InfluxDB: Time-series metrics visualization
Simple alerting: Email or Slack notifications for critical thresholds

Phase 6-8: Service Monitoring

Prometheus: Industry-standard metrics collection and alerting
ELK Stack (Elasticsearch, Logstash, Kibana): Log aggregation and analysis
Jaeger or Zipkin: Distributed tracing for service interactions
PagerDuty or Opsgenie: Professional incident management and escalation

Phase 9-10: Cloud-Native Observability

Cloud provider monitoring: AWS CloudWatch, GCP Monitoring, Azure Monitor
Kubernetes monitoring: Prometheus + Grafana for container orchestration
APM tools: New Relic, Datadog, or AppDynamics for comprehensive application performance monitoring
Infrastructure monitoring: Terraform state monitoring, cloud resource health tracking

Reliability Engineering Fundamentals

Defining Service Level Objectives

Choose meaningful SLIs for your services:

Request/Response Services:
- Availability: Percentage of successful responses (200-299 status codes)
- Latency: 95th percentile response time under specified load
- Throughput: Requests per second the service can handle sustainably
Data Processing Services:
- Completeness: Percentage of input records processed successfully
- Freshness: Time between data arrival and processing completion
- Accuracy: Percentage of outputs that match expected results
Infrastructure Services:
- Uptime: Percentage of time service is available and responsive
- Resource utilization: CPU, memory, disk usage within acceptable ranges
- Error rates: Percentage of operations that complete without errors

Example SLO definition:

service: user-authentication-api
slos:
  availability:
    sli: "Percentage of HTTP requests returning 200-299 status codes"
    target: "99.9% over rolling 30-day window"  
    error_budget: "0.1% (43.2 minutes per month)"
  
  latency:
    sli: "95th percentile response time for authentication requests"
    target: "< 200ms under normal load"
    measurement: "Exclude cold starts and scheduled maintenance"
    
  throughput:
    sli: "Sustained requests per second without degradation"  
    target: "> 1000 RPS with < 5% error rate increase"

Error Budget Management

Error budget calculation:

SLO target: 99.9% availability
Error budget: 0.1% = 43.2 minutes downtime per month
Burn rate: How quickly you're consuming the error budget

Decision framework:

Budget remaining > 50%: Focus on feature development, acceptable to take risks
Budget remaining 10-50%: Balance reliability improvements with feature work
Budget remaining < 10%: Prioritize reliability, defer risky feature releases

Practical Monitoring Implementation

Phase 5: Network Service Monitoring

Basic Flask service with monitoring:

from flask import Flask, request, jsonify
import time
import logging

app = Flask(__name__)
metrics = SimpleMetrics()  # From previous example

@app.route('/api/users/<user_id>')
@monitor_request  # Decorator from previous example  
def get_user(user_id):
    """Get user information with observability."""
    span_start = time.time()
    
    try:
        # Validate input
        if not user_id.isdigit():
            metrics.increment('user_requests_total', labels={'status': 'invalid_input'})
            return jsonify({'error': 'Invalid user ID'}), 400
        
        # Process request with timing
        with tracer.start_as_current_span("database_lookup"):
            user_data = database.get_user(int(user_id))
            
        if not user_data:
            metrics.increment('user_requests_total', labels={'status': 'not_found'})
            return jsonify({'error': 'User not found'}), 404
        
        # Success metrics  
        processing_time = time.time() - span_start
        metrics.histogram('request_duration_seconds', processing_time)
        metrics.increment('user_requests_total', labels={'status': 'success'})
        
        return jsonify(user_data)
        
    except Exception as e:
        # Error handling with observability
        processing_time = time.time() - span_start
        metrics.histogram('request_duration_seconds', processing_time)
        metrics.increment('user_requests_total', labels={'status': 'error'})
        
        logger.error(f"Unexpected error in get_user: {e}", 
                    extra={'user_id': user_id, 'error': str(e)})
        return jsonify({'error': 'Internal server error'}), 500

if __name__ == '__main__':
    # Start metrics reporting  
    import threading
    def report_metrics():
        while True:
            time.sleep(60)
            metrics.report()
    
    metrics_thread = threading.Thread(target=report_metrics)
    metrics_thread.daemon = True
    metrics_thread.start()
    
    app.run(host='0.0.0.0', port=5000)

Phase 6-8: Distributed System Observability

Service mesh observability (conceptual):

Traffic management: Monitor request routing, load balancing, circuit breaking
Security monitoring: Track authentication, authorization, and encryption status
Performance monitoring: End-to-end request tracing across multiple services
Dependency analysis: Understanding service relationships and failure propagation

Database monitoring integration:

import sqlalchemy
from sqlalchemy import event
from sqlalchemy.engine import Engine

# Monitor database operations
@event.listens_for(Engine, "before_cursor_execute")
def before_cursor_execute(conn, cursor, statement, parameters, context, executemany):
    context._query_start_time = time.time()
    metrics.increment('database_queries_total', labels={'operation': statement.split()[0].upper()})

@event.listens_for(Engine, "after_cursor_execute") 
def after_cursor_execute(conn, cursor, statement, parameters, context, executemany):
    total_time = time.time() - context._query_start_time
    metrics.histogram('database_query_duration_seconds', total_time)

Phase 9-10: Production Monitoring Stack

Cloud-native monitoring architecture:

Application metrics: Custom business metrics plus infrastructure metrics
Infrastructure monitoring: Cloud resource utilization, network performance, storage health
Security monitoring: Access patterns, authentication failures, potential security threats
Cost monitoring: Resource usage costs and optimization opportunities

Building Your SRE Mindset

Operational Questions for Every Design Decision

How will we know if this is working correctly in production?
What metrics indicate this component is healthy vs. unhealthy?
How do we detect when this component starts to degrade before it fails completely?
What's our plan for restoring service when this component fails?
How do we prevent this type of failure from happening again?

Balancing Reliability and Innovation

Reliability vs. Feature Development Trade-offs:

High error budget remaining: Take calculated risks on new features
Low error budget remaining: Focus on reliability improvements and technical debt
Error budget exceeded: Stop feature development, focus entirely on reliability

Cultural Practices:

Blameless postmortems: Focus on system improvement rather than individual fault
Error budget reviews: Regular assessment of reliability vs. innovation balance
Reliability targets: Explicit SLOs that guide engineering priorities

Resources for Continued Learning

Books by Phase

Phase 5: The Art of Monitoring - Fundamental monitoring concepts and practices
Phase 6-7: Monitoring and Observability - Comprehensive guide to observability engineering
Phase 8: Site Reliability Engineering - Google's approach to production reliability
Phase 9-10: The Site Reliability Workbook - Practical SRE implementation guidance

Online Resources

Prometheus documentation - Industry-standard metrics and alerting
Grafana tutorials - Visualization and dashboard creation
SRE resources - Google SRE books, blog posts, and case studies
Cloud provider monitoring guides - Platform-specific observability tools

Practical Exercises

Monitor a personal project - Add comprehensive observability to something you've built
Simulate failures - Practice incident response with controlled system failures
Build dashboards - Create operational dashboards for systems you understand well
Write runbooks - Document operational procedures for services you maintain

Community and Learning

SRE community forums - Learning from production engineering experiences
Conference talks - SRECon, Monitorama, and cloud provider conferences
Open source monitoring - Contribute to observability tools and share learnings
Professional development - Pursue SRE or DevOps roles that apply these skills professionally

Remember: Observability is not just tooling - it's a design philosophy that makes systems understandable and manageable in production. Every system you build should answer the question "How do we know if this is working correctly?" from the beginning of the design process.

Track Progression by Phase​

Level 1 Starter Guide: Basic Observability​

Getting Started Today (45 minutes)​

Core Observability Concepts (Learn These First)​

Practical Monitoring Techniques by System Type​

Single-Process Applications (Phase 5)​

Network Services (Phase 6-7)​

Distributed Systems (Phase 8-9)​

Building Your Reliability Mindset​

Questions for Every System Component​

Incident Response Basics (Phase 8+)​

Postmortem Template (Start Using in Phase 6)​

Essential Observability Tools by Phase​

Phase 5: Basic Monitoring​

Phase 6-8: Service Monitoring​

Phase 9-10: Cloud-Native Observability​

Reliability Engineering Fundamentals​

Defining Service Level Objectives​

Error Budget Management​

Practical Monitoring Implementation​

Phase 5: Network Service Monitoring​

Phase 6-8: Distributed System Observability​

Phase 9-10: Production Monitoring Stack​

Building Your SRE Mindset​

Operational Questions for Every Design Decision​

Balancing Reliability and Innovation​

Resources for Continued Learning​

Books by Phase​

Online Resources​

Practical Exercises​

Community and Learning​