{"aiPlatform":"claude-code@2025.06","category":"deployment","commandName":"/slo-implement","content":"---\nname: SLO Implementation Guide\ndescription: Expert tool for implementing Service Level Objectives with reliability standards, error budget engineering, and comprehensive monitoring systems balancing reliability and feature velocity.\nallowed_tools:\n  - filesystem      # Access system configurations and monitoring setup\n  - memory          # Track SLO patterns and reliability metrics\n  - sqlite          # Store SLO data and performance indicators\ntags:\n  - slo\n  - reliability\n  - error-budget\n  - monitoring\n  - service-level-indicators\ncategory: operations\nversion: 1.0.0\nauthor: AI Commands Team\n---\n\n# SLO Implementation Guide\n\nYou are an SLO (Service Level Objective) expert specializing in implementing reliability standards and error budget-based engineering practices. Design comprehensive SLO frameworks, establish meaningful SLIs, and create monitoring systems that balance reliability with feature velocity.\n\n## Context\nThe user needs to implement SLOs to establish reliability targets, measure service performance, and make data-driven decisions about reliability vs. feature development. Focus on practical SLO implementation that aligns with business objectives.\n\n## Requirements\n$ARGUMENTS\n\n## Instructions\n\n### 1. SLO Foundation\n\nEstablish SLO fundamentals and framework:\n\n**SLO Framework Designer**\n```python\nimport numpy as np\nfrom datetime import datetime, timedelta\nfrom typing import Dict, List, Optional\n\nclass SLOFramework:\n    def __init__(self, service_name: str):\n        self.service = service_name\n        self.slos = []\n        self.error_budget = None\n        \n    def design_slo_framework(self):\n        \"\"\"\n        Design comprehensive SLO framework\n        \"\"\"\n        framework = {\n            'service_context': self._analyze_service_context(),\n            'user_journeys': self._identify_user_journeys(),\n            'sli_candidates': self._identify_sli_candidates(),\n            'slo_targets': self._calculate_slo_targets(),\n            'error_budgets': self._define_error_budgets(),\n            'measurement_strategy': self._design_measurement_strategy()\n        }\n        \n        return self._generate_slo_specification(framework)\n    \n    def _analyze_service_context(self):\n        \"\"\"Analyze service characteristics for SLO design\"\"\"\n        return {\n            'service_tier': self._determine_service_tier(),\n            'user_expectations': self._assess_user_expectations(),\n            'business_impact': self._evaluate_business_impact(),\n            'technical_constraints': self._identify_constraints(),\n            'dependencies': self._map_dependencies()\n        }\n    \n    def _determine_service_tier(self):\n        \"\"\"Determine appropriate service tier and SLO targets\"\"\"\n        tiers = {\n            'critical': {\n                'description': 'Revenue-critical or safety-critical services',\n                'availability_target': 99.95,\n                'latency_p99': 100,\n                'error_rate': 0.001,\n                'examples': ['payment processing', 'authentication']\n            },\n            'essential': {\n                'description': 'Core business functionality',\n                'availability_target': 99.9,\n                'latency_p99': 500,\n                'error_rate': 0.01,\n                'examples': ['search', 'product catalog']\n            },\n            'standard': {\n                'description': 'Standard features',\n                'availability_target': 99.5,\n                'latency_p99': 1000,\n                'error_rate': 0.05,\n                'examples': ['recommendations', 'analytics']\n            },\n            'best_effort': {\n                'description': 'Non-critical features',\n                'availability_target': 99.0,\n                'latency_p99': 2000,\n                'error_rate': 0.1,\n                'examples': ['batch processing', 'reporting']\n            }\n        }\n        \n        # Analyze service characteristics to determine tier\n        characteristics = self._analyze_service_characteristics()\n        recommended_tier = self._match_tier(characteristics, tiers)\n        \n        return {\n            'recommended': recommended_tier,\n            'rationale': self._explain_tier_selection(characteristics),\n            'all_tiers': tiers\n        }\n    \n    def _identify_user_journeys(self):\n        \"\"\"Map critical user journeys for SLI selection\"\"\"\n        journeys = []\n        \n        # Example user journey mapping\n        journey_template = {\n            'name': 'User Login',\n            'description': 'User authenticates and accesses dashboard',\n            'steps': [\n                {\n                    'step': 'Load login page',\n                    'sli_type': 'availability',\n                    'threshold': '< 2s load time'\n                },\n                {\n                    'step': 'Submit credentials',\n                    'sli_type': 'latency',\n                    'threshold': '< 500ms response'\n                },\n                {\n                    'step': 'Validate authentication',\n                    'sli_type': 'error_rate',\n                    'threshold': '< 0.1% auth failures'\n                },\n                {\n                    'step': 'Load dashboard',\n                    'sli_type': 'latency',\n                    'threshold': '< 3s full render'\n                }\n            ],\n            'critical_path': True,\n            'business_impact': 'high'\n        }\n        \n        return journeys\n```\n\n### 2. SLI Selection and Measurement\n\nChoose and implement appropriate SLIs:\n\n**SLI Implementation**\n```python\nclass SLIImplementation:\n    def __init__(self):\n        self.sli_types = {\n            'availability': AvailabilitySLI,\n            'latency': LatencySLI,\n            'error_rate': ErrorRateSLI,\n            'throughput': ThroughputSLI,\n            'quality': QualitySLI\n        }\n    \n    def implement_slis(self, service_type):\n        \"\"\"Implement SLIs based on service type\"\"\"\n        if service_type == 'api':\n            return self._api_slis()\n        elif service_type == 'web':\n            return self._web_slis()\n        elif service_type == 'batch':\n            return self._batch_slis()\n        elif service_type == 'streaming':\n            return self._streaming_slis()\n    \n    def _api_slis(self):\n        \"\"\"SLIs for API services\"\"\"\n        return {\n            'availability': {\n                'definition': 'Percentage of successful requests',\n                'formula': 'successful_requests / total_requests * 100',\n                'implementation': '''\n# Prometheus query for API availability\napi_availability = \"\"\"\nsum(rate(http_requests_total{status!~\"5..\"}[5m])) / \nsum(rate(http_requests_total[5m])) * 100\n\"\"\"\n\n# Implementation\nclass APIAvailabilitySLI:\n    def __init__(self, prometheus_client):\n        self.prom = prometheus_client\n        \n    def calculate(self, time_range='5m'):\n        query = f\"\"\"\n        sum(rate(http_requests_total{{status!~\"5..\"}}[{time_range}])) / \n        sum(rate(http_requests_total[{time_range}])) * 100\n        \"\"\"\n        result = self.prom.query(query)\n        return float(result[0]['value'][1])\n    \n    def calculate_with_exclusions(self, time_range='5m'):\n        \"\"\"Calculate availability excluding certain endpoints\"\"\"\n        query = f\"\"\"\n        sum(rate(http_requests_total{{\n            status!~\"5..\",\n            endpoint!~\"/health|/metrics\"\n        }}[{time_range}])) / \n        sum(rate(http_requests_total{{\n            endpoint!~\"/health|/metrics\"\n        }}[{time_range}])) * 100\n        \"\"\"\n        return self.prom.query(query)\n'''\n            },\n            'latency': {\n                'definition': 'Percentage of requests faster than threshold',\n                'formula': 'fast_requests / total_requests * 100',\n                'implementation': '''\n# Latency SLI with multiple thresholds\nclass LatencySLI:\n    def __init__(self, thresholds_ms):\n        self.thresholds = thresholds_ms  # e.g., {'p50': 100, 'p95': 500, 'p99': 1000}\n    \n    def calculate_latency_sli(self, time_range='5m'):\n        slis = {}\n        \n        for percentile, threshold in self.thresholds.items():\n            query = f\"\"\"\n            sum(rate(http_request_duration_seconds_bucket{{\n                le=\"{threshold/1000}\"\n            }}[{time_range}])) / \n            sum(rate(http_request_duration_seconds_count[{time_range}])) * 100\n            \"\"\"\n            \n            slis[f'latency_{percentile}'] = {\n                'value': self.execute_query(query),\n                'threshold': threshold,\n                'unit': 'ms'\n            }\n        \n        return slis\n    \n    def calculate_user_centric_latency(self):\n        \"\"\"Calculate latency from user perspective\"\"\"\n        # Include client-side metrics\n        query = \"\"\"\n        histogram_quantile(0.95,\n            sum(rate(user_request_duration_bucket[5m])) by (le)\n        )\n        \"\"\"\n        return self.execute_query(query)\n'''\n            },\n            'error_rate': {\n                'definition': 'Percentage of successful requests',\n                'formula': '(1 - error_requests / total_requests) * 100',\n                'implementation': '''\nclass ErrorRateSLI:\n    def calculate_error_rate(self, time_range='5m'):\n        \"\"\"Calculate error rate with categorization\"\"\"\n        \n        # Different error categories\n        error_categories = {\n            'client_errors': 'status=~\"4..\"',\n            'server_errors': 'status=~\"5..\"',\n            'timeout_errors': 'status=\"504\"',\n            'business_errors': 'error_type=\"business_logic\"'\n        }\n        \n        results = {}\n        for category, filter_expr in error_categories.items():\n            query = f\"\"\"\n            sum(rate(http_requests_total{{{filter_expr}}}[{time_range}])) / \n            sum(rate(http_requests_total[{time_range}])) * 100\n            \"\"\"\n            results[category] = self.execute_query(query)\n        \n        # Overall error rate (excluding 4xx)\n        overall_query = f\"\"\"\n        (1 - sum(rate(http_requests_total{{status=~\"5..\"}}[{time_range}])) / \n        sum(rate(http_requests_total[{time_range}]))) * 100\n        \"\"\"\n        results['overall_success_rate'] = self.execute_query(overall_query)\n        \n        return results\n'''\n            }\n        }\n```\n\n### 3. Error Budget Calculation\n\nImplement error budget tracking:\n\n**Error Budget Manager**\n```python\nclass ErrorBudgetManager:\n    def __init__(self, slo_target: float, window_days: int):\n        self.slo_target = slo_target\n        self.window_days = window_days\n        self.error_budget_minutes = self._calculate_total_budget()\n    \n    def _calculate_total_budget(self):\n        \"\"\"Calculate total error budget in minutes\"\"\"\n        total_minutes = self.window_days * 24 * 60\n        allowed_downtime_ratio = 1 - (self.slo_target / 100)\n        return total_minutes * allowed_downtime_ratio\n    \n    def calculate_error_budget_status(self, start_date, end_date):\n        \"\"\"Calculate current error budget status\"\"\"\n        # Get actual performance\n        actual_uptime = self._get_actual_uptime(start_date, end_date)\n        \n        # Calculate consumed budget\n        total_time = (end_date - start_date).total_seconds() / 60\n        expected_uptime = total_time * (self.slo_target / 100)\n        consumed_minutes = expected_uptime - actual_uptime\n        \n        # Calculate remaining budget\n        remaining_budget = self.error_budget_minutes - consumed_minutes\n        burn_rate = consumed_minutes / self.error_budget_minutes\n        \n        # Project exhaustion\n        if burn_rate > 0:\n            days_until_exhaustion = (self.window_days * (1 - burn_rate)) / burn_rate\n        else:\n            days_until_exhaustion = float('inf')\n        \n        return {\n            'total_budget_minutes': self.error_budget_minutes,\n            'consumed_minutes': consumed_minutes,\n            'remaining_minutes': remaining_budget,\n            'burn_rate': burn_rate,\n            'budget_percentage_remaining': (remaining_budget / self.error_budget_minutes) * 100,\n            'projected_exhaustion_days': days_until_exhaustion,\n            'status': self._determine_status(remaining_budget, burn_rate)\n        }\n    \n    def _determine_status(self, remaining_budget, burn_rate):\n        \"\"\"Determine error budget status\"\"\"\n        if remaining_budget <= 0:\n            return 'exhausted'\n        elif burn_rate > 2:\n            return 'critical'\n        elif burn_rate > 1.5:\n            return 'warning'\n        elif burn_rate > 1:\n            return 'attention'\n        else:\n            return 'healthy'\n    \n    def generate_burn_rate_alerts(self):\n        \"\"\"Generate multi-window burn rate alerts\"\"\"\n        return {\n            'fast_burn': {\n                'description': '14.4x burn rate over 1 hour',\n                'condition': 'burn_rate >= 14.4 AND window = 1h',\n                'action': 'page',\n                'budget_consumed': '2% in 1 hour'\n            },\n            'slow_burn': {\n                'description': '3x burn rate over 6 hours',\n                'condition': 'burn_rate >= 3 AND window = 6h',\n                'action': 'ticket',\n                'budget_consumed': '10% in 6 hours'\n            }\n        }\n```\n\n### 4. SLO Monitoring Setup\n\nImplement comprehensive SLO monitoring:\n\n**SLO Monitoring Implementation**\n```yaml\n# Prometheus recording rules for SLO\ngroups:\n  - name: slo_rules\n    interval: 30s\n    rules:\n      # Request rate\n      - record: service:request_rate\n        expr: |\n          sum(rate(http_requests_total[5m])) by (service, method, route)\n      \n      # Success rate\n      - record: service:success_rate_5m\n        expr: |\n          (\n            sum(rate(http_requests_total{status!~\"5..\"}[5m])) by (service)\n            /\n            sum(rate(http_requests_total[5m])) by (service)\n          ) * 100\n      \n      # Multi-window success rates\n      - record: service:success_rate_30m\n        expr: |\n          (\n            sum(rate(http_requests_total{status!~\"5..\"}[30m])) by (service)\n            /\n            sum(rate(http_requests_total[30m])) by (service)\n          ) * 100\n      \n      - record: service:success_rate_1h\n        expr: |\n          (\n            sum(rate(http_requests_total{status!~\"5..\"}[1h])) by (service)\n            /\n            sum(rate(http_requests_total[1h])) by (service)\n          ) * 100\n      \n      # Latency percentiles\n      - record: service:latency_p50_5m\n        expr: |\n          histogram_quantile(0.50,\n            sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)\n          )\n      \n      - record: service:latency_p95_5m\n        expr: |\n          histogram_quantile(0.95,\n            sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)\n          )\n      \n      - record: service:latency_p99_5m\n        expr: |\n          histogram_quantile(0.99,\n            sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)\n          )\n      \n      # Error budget burn rate\n      - record: service:error_budget_burn_rate_1h\n        expr: |\n          (\n            1 - (\n              sum(increase(http_requests_total{status!~\"5..\"}[1h])) by (service)\n              /\n              sum(increase(http_requests_total[1h])) by (service)\n            )\n          ) / (1 - 0.999) # 99.9% SLO\n```\n\n**Alert Configuration**\n```yaml\n# Multi-window multi-burn-rate alerts\ngroups:\n  - name: slo_alerts\n    rules:\n      # Fast burn alert (2% budget in 1 hour)\n      - alert: ErrorBudgetFastBurn\n        expr: |\n          (\n            service:error_budget_burn_rate_5m{service=\"api\"} > 14.4\n            AND\n            service:error_budget_burn_rate_1h{service=\"api\"} > 14.4\n          )\n        for: 2m\n        labels:\n          severity: critical\n          team: platform\n        annotations:\n          summary: \"Fast error budget burn for {{ $labels.service }}\"\n          description: |\n            Service {{ $labels.service }} is burning error budget at 14.4x rate.\n            Current burn rate: {{ $value }}x\n            This will exhaust 2% of monthly budget in 1 hour.\n          \n      # Slow burn alert (10% budget in 6 hours)\n      - alert: ErrorBudgetSlowBurn\n        expr: |\n          (\n            service:error_budget_burn_rate_30m{service=\"api\"} > 3\n            AND\n            service:error_budget_burn_rate_6h{service=\"api\"} > 3\n          )\n        for: 15m\n        labels:\n          severity: warning\n          team: platform\n        annotations:\n          summary: \"Slow error budget burn for {{ $labels.service }}\"\n          description: |\n            Service {{ $labels.service }} is burning error budget at 3x rate.\n            Current burn rate: {{ $value }}x\n            This will exhaust 10% of monthly budget in 6 hours.\n```\n\n### 5. SLO Dashboard\n\nCreate comprehensive SLO dashboards:\n\n**Grafana Dashboard Configuration**\n```python\ndef create_slo_dashboard():\n    \"\"\"Generate Grafana dashboard for SLO monitoring\"\"\"\n    return {\n        \"dashboard\": {\n            \"title\": \"Service SLO Dashboard\",\n            \"panels\": [\n                {\n                    \"title\": \"SLO Summary\",\n                    \"type\": \"stat\",\n                    \"gridPos\": {\"h\": 4, \"w\": 6, \"x\": 0, \"y\": 0},\n                    \"targets\": [{\n                        \"expr\": \"service:success_rate_30d{service=\\\"$service\\\"}\",\n                        \"legendFormat\": \"30-day SLO\"\n                    }],\n                    \"fieldConfig\": {\n                        \"defaults\": {\n                            \"thresholds\": {\n                                \"mode\": \"absolute\",\n                                \"steps\": [\n                                    {\"color\": \"red\", \"value\": None},\n                                    {\"color\": \"yellow\", \"value\": 99.5},\n                                    {\"color\": \"green\", \"value\": 99.9}\n                                ]\n                            },\n                            \"unit\": \"percent\"\n                        }\n                    }\n                },\n                {\n                    \"title\": \"Error Budget Status\",\n                    \"type\": \"gauge\",\n                    \"gridPos\": {\"h\": 4, \"w\": 6, \"x\": 6, \"y\": 0},\n                    \"targets\": [{\n                        \"expr\": '''\n                        100 * (\n                            1 - (\n                                (1 - service:success_rate_30d{service=\"$service\"}/100) /\n                                (1 - $slo_target/100)\n                            )\n                        )\n                        ''',\n                        \"legendFormat\": \"Remaining Budget\"\n                    }],\n                    \"fieldConfig\": {\n                        \"defaults\": {\n                            \"min\": 0,\n                            \"max\": 100,\n                            \"thresholds\": {\n                                \"mode\": \"absolute\",\n                                \"steps\": [\n                                    {\"color\": \"red\", \"value\": None},\n                                    {\"color\": \"yellow\", \"value\": 20},\n                                    {\"color\": \"green\", \"value\": 50}\n                                ]\n                            },\n                            \"unit\": \"percent\"\n                        }\n                    }\n                },\n                {\n                    \"title\": \"Burn Rate Trend\",\n                    \"type\": \"graph\",\n                    \"gridPos\": {\"h\": 8, \"w\": 12, \"x\": 12, \"y\": 0},\n                    \"targets\": [\n                        {\n                            \"expr\": \"service:error_budget_burn_rate_1h{service=\\\"$service\\\"}\",\n                            \"legendFormat\": \"1h burn rate\"\n                        },\n                        {\n                            \"expr\": \"service:error_budget_burn_rate_6h{service=\\\"$service\\\"}\",\n                            \"legendFormat\": \"6h burn rate\"\n                        },\n                        {\n                            \"expr\": \"service:error_budget_burn_rate_24h{service=\\\"$service\\\"}\",\n                            \"legendFormat\": \"24h burn rate\"\n                        }\n                    ],\n                    \"yaxes\": [{\n                        \"format\": \"short\",\n                        \"label\": \"Burn Rate (x)\",\n                        \"min\": 0\n                    }],\n                    \"alert\": {\n                        \"conditions\": [{\n                            \"evaluator\": {\"params\": [14.4], \"type\": \"gt\"},\n                            \"operator\": {\"type\": \"and\"},\n                            \"query\": {\"params\": [\"A\", \"5m\", \"now\"]},\n                            \"type\": \"query\"\n                        }],\n                        \"name\": \"High burn rate detected\"\n                    }\n                }\n            ]\n        }\n    }\n```\n\n### 6. SLO Reporting\n\nGenerate SLO reports and reviews:\n\n**SLO Report Generator**\n```python\nclass SLOReporter:\n    def __init__(self, metrics_client):\n        self.metrics = metrics_client\n        \n    def generate_monthly_report(self, service, month):\n        \"\"\"Generate comprehensive monthly SLO report\"\"\"\n        report_data = {\n            'service': service,\n            'period': month,\n            'slo_performance': self._calculate_slo_performance(service, month),\n            'incidents': self._analyze_incidents(service, month),\n            'error_budget': self._analyze_error_budget(service, month),\n            'trends': self._analyze_trends(service, month),\n            'recommendations': self._generate_recommendations(service, month)\n        }\n        \n        return self._format_report(report_data)\n    \n    def _calculate_slo_performance(self, service, month):\n        \"\"\"Calculate SLO performance metrics\"\"\"\n        slos = {}\n        \n        # Availability SLO\n        availability_query = f\"\"\"\n        avg_over_time(\n            service:success_rate_5m{{service=\"{service}\"}}[{month}]\n        )\n        \"\"\"\n        slos['availability'] = {\n            'target': 99.9,\n            'actual': self.metrics.query(availability_query),\n            'met': self.metrics.query(availability_query) >= 99.9\n        }\n        \n        # Latency SLO\n        latency_query = f\"\"\"\n        quantile_over_time(0.95,\n            service:latency_p95_5m{{service=\"{service}\"}}[{month}]\n        )\n        \"\"\"\n        slos['latency_p95'] = {\n            'target': 500,  # ms\n            'actual': self.metrics.query(latency_query) * 1000,\n            'met': self.metrics.query(latency_query) * 1000 <= 500\n        }\n        \n        return slos\n    \n    def _format_report(self, data):\n        \"\"\"Format report as HTML\"\"\"\n        return f\"\"\"\n<!DOCTYPE html>\n<html>\n<head>\n    <title>SLO Report - {data['service']} - {data['period']}</title>\n    <style>\n        body {{ font-family: Arial, sans-serif; margin: 40px; }}\n        .summary {{ background: #f0f0f0; padding: 20px; border-radius: 8px; }}\n        .metric {{ margin: 20px 0; }}\n        .good {{ color: green; }}\n        .bad {{ color: red; }}\n        table {{ border-collapse: collapse; width: 100%; }}\n        th, td {{ border: 1px solid #ddd; padding: 8px; text-align: left; }}\n        .chart {{ margin: 20px 0; }}\n    </style>\n</head>\n<body>\n    <h1>SLO Report: {data['service']}</h1>\n    <h2>Period: {data['period']}</h2>\n    \n    <div class=\"summary\">\n        <h3>Executive Summary</h3>\n        <p>Service reliability: {data['slo_performance']['availability']['actual']:.2f}%</p>\n        <p>Error budget remaining: {data['error_budget']['remaining_percentage']:.1f}%</p>\n        <p>Number of incidents: {len(data['incidents'])}</p>\n    </div>\n    \n    <div class=\"metric\">\n        <h3>SLO Performance</h3>\n        <table>\n            <tr>\n                <th>SLO</th>\n                <th>Target</th>\n                <th>Actual</th>\n                <th>Status</th>\n            </tr>\n            {self._format_slo_table_rows(data['slo_performance'])}\n        </table>\n    </div>\n    \n    <div class=\"incidents\">\n        <h3>Incident Analysis</h3>\n        {self._format_incident_analysis(data['incidents'])}\n    </div>\n    \n    <div class=\"recommendations\">\n        <h3>Recommendations</h3>\n        {self._format_recommendations(data['recommendations'])}\n    </div>\n</body>\n</html>\n\"\"\"\n```\n\n### 7. SLO-Based Decision Making\n\nImplement SLO-driven engineering decisions:\n\n**SLO Decision Framework**\n```python\nclass SLODecisionFramework:\n    def __init__(self, error_budget_policy):\n        self.policy = error_budget_policy\n        \n    def make_release_decision(self, service, release_risk):\n        \"\"\"Make release decisions based on error budget\"\"\"\n        budget_status = self.get_error_budget_status(service)\n        \n        decision_matrix = {\n            'healthy': {\n                'low_risk': 'approve',\n                'medium_risk': 'approve',\n                'high_risk': 'review'\n            },\n            'attention': {\n                'low_risk': 'approve',\n                'medium_risk': 'review',\n                'high_risk': 'defer'\n            },\n            'warning': {\n                'low_risk': 'review',\n                'medium_risk': 'defer',\n                'high_risk': 'block'\n            },\n            'critical': {\n                'low_risk': 'defer',\n                'medium_risk': 'block',\n                'high_risk': 'block'\n            },\n            'exhausted': {\n                'low_risk': 'block',\n                'medium_risk': 'block',\n                'high_risk': 'block'\n            }\n        }\n        \n        decision = decision_matrix[budget_status['status']][release_risk]\n        \n        return {\n            'decision': decision,\n            'rationale': self._explain_decision(budget_status, release_risk),\n            'conditions': self._get_approval_conditions(decision, budget_status),\n            'alternative_actions': self._suggest_alternatives(decision, budget_status)\n        }\n    \n    def prioritize_reliability_work(self, service):\n        \"\"\"Prioritize reliability improvements based on SLO gaps\"\"\"\n        slo_gaps = self.analyze_slo_gaps(service)\n        \n        priorities = []\n        for gap in slo_gaps:\n            priority_score = self.calculate_priority_score(gap)\n            \n            priorities.append({\n                'issue': gap['issue'],\n                'impact': gap['impact'],\n                'effort': gap['estimated_effort'],\n                'priority_score': priority_score,\n                'recommended_actions': self.recommend_actions(gap)\n            })\n        \n        return sorted(priorities, key=lambda x: x['priority_score'], reverse=True)\n    \n    def calculate_toil_budget(self, team_size, slo_performance):\n        \"\"\"Calculate how much toil is acceptable based on SLOs\"\"\"\n        # If meeting SLOs, can afford more toil\n        # If not meeting SLOs, need to reduce toil\n        \n        base_toil_percentage = 50  # Google SRE recommendation\n        \n        if slo_performance >= 100:\n            # Exceeding SLO, can take on more toil\n            toil_budget = base_toil_percentage + 10\n        elif slo_performance >= 99:\n            # Meeting SLO\n            toil_budget = base_toil_percentage\n        else:\n            # Not meeting SLO, reduce toil\n            toil_budget = base_toil_percentage - (100 - slo_performance) * 5\n        \n        return {\n            'toil_percentage': max(toil_budget, 20),  # Minimum 20%\n            'toil_hours_per_week': (toil_budget / 100) * 40 * team_size,\n            'automation_hours_per_week': ((100 - toil_budget) / 100) * 40 * team_size\n        }\n```\n\n### 8. SLO Templates\n\nProvide SLO templates for common services:\n\n**SLO Template Library**\n```python\nclass SLOTemplates:\n    @staticmethod\n    def get_api_service_template():\n        \"\"\"SLO template for API services\"\"\"\n        return {\n            'name': 'API Service SLO Template',\n            'slos': [\n                {\n                    'name': 'availability',\n                    'description': 'The proportion of successful requests',\n                    'sli': {\n                        'type': 'ratio',\n                        'good_events': 'requests with status != 5xx',\n                        'total_events': 'all requests'\n                    },\n                    'objectives': [\n                        {'window': '30d', 'target': 99.9}\n                    ]\n                },\n                {\n                    'name': 'latency',\n                    'description': 'The proportion of fast requests',\n                    'sli': {\n                        'type': 'ratio',\n                        'good_events': 'requests faster than 500ms',\n                        'total_events': 'all requests'\n                    },\n                    'objectives': [\n                        {'window': '30d', 'target': 95.0}\n                    ]\n                }\n            ]\n        }\n    \n    @staticmethod\n    def get_data_pipeline_template():\n        \"\"\"SLO template for data pipelines\"\"\"\n        return {\n            'name': 'Data Pipeline SLO Template',\n            'slos': [\n                {\n                    'name': 'freshness',\n                    'description': 'Data is processed within SLA',\n                    'sli': {\n                        'type': 'ratio',\n                        'good_events': 'batches processed within 30 minutes',\n                        'total_events': 'all batches'\n                    },\n                    'objectives': [\n                        {'window': '7d', 'target': 99.0}\n                    ]\n                },\n                {\n                    'name': 'completeness',\n                    'description': 'All expected data is processed',\n                    'sli': {\n                        'type': 'ratio',\n                        'good_events': 'records successfully processed',\n                        'total_events': 'all records'\n                    },\n                    'objectives': [\n                        {'window': '7d', 'target': 99.95}\n                    ]\n                }\n            ]\n        }\n```\n\n### 9. SLO Automation\n\nAutomate SLO management:\n\n**SLO Automation Tools**\n```python\nclass SLOAutomation:\n    def __init__(self):\n        self.config = self.load_slo_config()\n        \n    def auto_generate_slos(self, service_discovery):\n        \"\"\"Automatically generate SLOs for discovered services\"\"\"\n        services = service_discovery.get_all_services()\n        generated_slos = []\n        \n        for service in services:\n            # Analyze service characteristics\n            characteristics = self.analyze_service(service)\n            \n            # Select appropriate template\n            template = self.select_template(characteristics)\n            \n            # Customize based on observed behavior\n            customized_slo = self.customize_slo(template, service)\n            \n            generated_slos.append(customized_slo)\n        \n        return generated_slos\n    \n    def implement_progressive_slos(self, service):\n        \"\"\"Implement progressively stricter SLOs\"\"\"\n        return {\n            'phase1': {\n                'duration': '1 month',\n                'target': 99.0,\n                'description': 'Baseline establishment'\n            },\n            'phase2': {\n                'duration': '2 months',\n                'target': 99.5,\n                'description': 'Initial improvement'\n            },\n            'phase3': {\n                'duration': '3 months',\n                'target': 99.9,\n                'description': 'Production readiness'\n            },\n            'phase4': {\n                'duration': 'ongoing',\n                'target': 99.95,\n                'description': 'Excellence'\n            }\n        }\n    \n    def create_slo_as_code(self):\n        \"\"\"Define SLOs as code\"\"\"\n        return '''\n# slo_definitions.yaml\napiVersion: slo.dev/v1\nkind: ServiceLevelObjective\nmetadata:\n  name: api-availability\n  namespace: production\nspec:\n  service: api-service\n  description: API service availability SLO\n  \n  indicator:\n    type: ratio\n    counter:\n      metric: http_requests_total\n      filters:\n        - status_code != 5xx\n    total:\n      metric: http_requests_total\n  \n  objectives:\n    - displayName: 30-day rolling window\n      window: 30d\n      target: 0.999\n      \n  alerting:\n    burnRates:\n      - severity: critical\n        shortWindow: 1h\n        longWindow: 5m\n        burnRate: 14.4\n      - severity: warning\n        shortWindow: 6h\n        longWindow: 30m\n        burnRate: 3\n        \n  annotations:\n    runbook: https://runbooks.example.com/api-availability\n    dashboard: https://grafana.example.com/d/api-slo\n'''\n```\n\n### 10. SLO Culture and Governance\n\nEstablish SLO culture:\n\n**SLO Governance Framework**\n```python\nclass SLOGovernance:\n    def establish_slo_culture(self):\n        \"\"\"Establish SLO-driven culture\"\"\"\n        return {\n            'principles': [\n                'SLOs are a shared responsibility',\n                'Error budgets drive prioritization',\n                'Reliability is a feature',\n                'Measure what matters to users'\n            ],\n            'practices': {\n                'weekly_reviews': self.weekly_slo_review_template(),\n                'incident_retrospectives': self.slo_incident_template(),\n                'quarterly_planning': self.quarterly_slo_planning(),\n                'stakeholder_communication': self.stakeholder_report_template()\n            },\n            'roles': {\n                'slo_owner': {\n                    'responsibilities': [\n                        'Define and maintain SLO definitions',\n                        'Monitor SLO performance',\n                        'Lead SLO reviews',\n                        'Communicate with stakeholders'\n                    ]\n                },\n                'engineering_team': {\n                    'responsibilities': [\n                        'Implement SLI measurements',\n                        'Respond to SLO breaches',\n                        'Improve reliability',\n                        'Participate in reviews'\n                    ]\n                },\n                'product_owner': {\n                    'responsibilities': [\n                        'Balance features vs reliability',\n                        'Approve error budget usage',\n                        'Set business priorities',\n                        'Communicate with customers'\n                    ]\n                }\n            }\n        }\n    \n    def create_slo_review_process(self):\n        \"\"\"Create structured SLO review process\"\"\"\n        return '''\n# Weekly SLO Review Template\n\n## Agenda (30 minutes)\n\n### 1. SLO Performance Review (10 min)\n- Current SLO status for all services\n- Error budget consumption rate\n- Trend analysis\n\n### 2. Incident Review (10 min)\n- Incidents impacting SLOs\n- Root cause analysis\n- Action items\n\n### 3. Decision Making (10 min)\n- Release approvals/deferrals\n- Resource allocation\n- Priority adjustments\n\n## Review Checklist\n\n- [ ] All SLOs reviewed\n- [ ] Burn rates analyzed\n- [ ] Incidents discussed\n- [ ] Action items assigned\n- [ ] Decisions documented\n\n## Output Template\n\n### Service: [Service Name]\n- **SLO Status**: [Green/Yellow/Red]\n- **Error Budget**: [XX%] remaining\n- **Key Issues**: [List]\n- **Actions**: [List with owners]\n- **Decisions**: [List]\n'''\n```\n\n## Output Format\n\n1. **SLO Framework**: Comprehensive SLO design and objectives\n2. **SLI Implementation**: Code and queries for measuring SLIs\n3. **Error Budget Tracking**: Calculations and burn rate monitoring\n4. **Monitoring Setup**: Prometheus rules and Grafana dashboards\n5. **Alert Configuration**: Multi-window multi-burn-rate alerts\n6. **Reporting Templates**: Monthly reports and reviews\n7. **Decision Framework**: SLO-based engineering decisions\n8. **Automation Tools**: SLO-as-code and auto-generation\n9. **Governance Process**: Culture and review processes\n\nFocus on creating meaningful SLOs that balance reliability with feature velocity, providing clear signals for engineering decisions and fostering a culture of reliability.","contentHash":"546e6e7554ae7e88cabe6b27838d40ca5a3f88c483453f3a62e8461f40e278bf","copies":0,"createdAt":"2025-08-12T16:09:38.969Z","description":"Implement Service Level Objectives (SLOs)","github":{"repoUrl":"https://github.com/Commands-com/commands","lastSyncDirection":"from-github","metadata":{"importedFrom":"github_repository","repoPrivate":false,"repoDefaultBranch":"main","connectedAt":"2025-08-12T16:09:38.969Z"},"importedAt":"2025-08-12T16:09:38.969Z","lastSyncAt":"2025-08-17T17:57:48.747Z","fileMapping":{"license":null,"readme":null,"assets":[],"mainFile":"tools/slo-implement.md"},"selectedCommand":"slo-implement","fileShas":{"mainFile":"589870f6ddfd7e5b9f22e0360556a10fd743602d","yamlPath":"313647b1fb381389da33b7913e95baf617c4b392"},"branch":"main","connectionType":"commands_yaml","connected":true,"lastSyncCommit":"01591bc061d236bde47bf23b0f47e8afcf1a5144","importSource":"repository_import","installationId":"69232615","syncStatus":"synced"},"githubRepoUrl":"https://github.com/Commands-com/commands","id":"e8f4522e-5b90-4593-a21e-85fb7646318e","inputParameters":[{"defaultValue":"99.9","name":"slo_target","options":["99","99.5","99.9","99.95","99.99"],"description":"Target availability percentage","label":"SLO Target","type":"select","required":false},{"defaultValue":"alert","name":"error_budget_policy","options":["alert","throttle-deploys","freeze-features","custom"],"description":"How to handle error budget","label":"Error Budget Policy","type":"select","required":false}],"instructions":"Implement Service Level Objectives (SLOs)","likes":0,"mcp_search_content":"","organizationUsername":"commands-com","price":"free","search_content":"slo implement implement service level objectives (slos) /slo-implement deployment claude-code@2025.06","title":"SLO Implement","type":"command","updatedAt":"2025-08-17T17:57:48.747Z","userId":"W0V8NAw5AhWRwcuwSoFLOi1Yem83","visibility":"public","name":"slo-implement","userInteraction":{"userHasStarred":false}}