Reliability Engineering in the Cloud

by Mariya Breyter; Carlos Rojas

Cloud Computing

Book Details

Book Title

Reliability Engineering in the Cloud

Author

Mariya Breyter; Carlos Rojas

Publisher

Addison-Wesley

Publication Date

2025

ISBN

9780135395790

Number of Pages

471

Language

English

Format

PDF

File Size

6MB

Subject

cloud-computing

Table of Contents

  • Cover Page
  • About This eBook
  • Title Page
  • Copyright Page
  • Dedication Page
  • Contents
  • Preface
  • Acknowledgments
  • About the Authors
  • Chapter 1. Reliability Engineering in the Cloud
  • Cloud
  • Resilience
  • Reliability
  • Engineering
  • Engineering Excellence
  • How to Design and Build Resilient and Reliable Applications
  • Leveraging Lean Principles
  • Leveraging Artificial Intelligence
  • Leveraging Value Stream Mapping
  • Culture and Values
  • Operational Excellence
  • Summary
  • Q&A
  • Chapter 2. Resilient, Available, and Scalable Systems
  • Key Concepts
  • Design Principles
  • Chaos Engineering
  • Validating Resilience
  • Summary
  • Q&A
  • Chapter 3. Incident Response for Fast Recovery
  • Incident Response
  • Fast Recovery
  • Incident Handling
  • Summary
  • Q&A
  • Chapter 4. Operational Excellence and Change Management
  • Key Performance Indicators
  • Root Cause Analysis
  • Incident Reviews
  • Change Management
  • Case Study
  • Architecture and Reliability Assessments
  • Summary
  • Q&A
  • Chapter 5. Leveraging Observability, Monitoring, Reliability Metrics, and GenAI
  • Reliability Engineering Capabilities
  • Ten-Step Process for Creating Effective Monitoring
  • Maturity Levels
  • Monitoring and Alerting Tools
  • Case Study: AI’s Impact on CRE
  • Summary
  • Q&A
  • Chapter 6. CRE via Objectives and Key Results (OKRs)
  • Continuous Improvement in Lean
  • Application of Lean to CRE
  • Application of OKRs to CRE
  • Summary
  • Q&A
  • Chapter 7. CRE Tooling
  • Distributing Load and Volume with Auto-Scaling and Load Balancing
  • Enabling Automatic Failovers for High Availability
  • Facilitating Controlled Deployments with Rollback Strategies
  • Providing Chaos Engineering Capabilities for Resilience Testing
  • Assisting in Incident Response with Automation
  • Ensuring Proper Configuration Management
  • Leveraging Immutable Infrastructure as a Service
  • Practicing Disaster Recovery Frequently
  • Case Study
  • Summary
  • Q&A
  • Chapter 8. Cutting-Edge Technologies
  • Understanding AI, ML, LLMs, and GenAI
  • Benefits of Integrating These Technologies into CRE Practices
  • Implementation Considerations
  • Summary
  • Q&A
  • Chapter 9. CRE Value Stream
  • What Is a Value Stream?
  • CRE as a Value Stream
  • Case Studies
  • Summary
  • Q&A
  • Chapter 10. Culture
  • Psychological Safety
  • Employee Empowerment
  • Leadership and Ownership
  • Collaboration and Cross-Functional Teams
  • Customer Obsession
  • CRE Culture
  • Summary
  • Q&A
  • Chapter 11. The Business Case for CRE
  • Benefits of CRE
  • Aligning CRE with Strategic Objectives
  • Evolution of CRE Practices
  • Case Studies
  • Summary
  • Q&A
  • Chapter 12. Conclusion
  • Appendix A. Incident Response Checklist Template
  • Appendix B. Correction of Error (COE) Document Structure
  • Appendix C. CRE Change Management Checklist
  • Glossary
  • References
  • Index