High availability means designing a system to keep working even when some components fail.
In cloud architecture, this usually means avoiding single points of failure across zones, load balancers, compute, databases, storage, and deployment processes.
The core principle is: assume failure will happen, then design the system so users do not feel every failure.
Cloud Architecture Brief
Many people think cloud automatically means high availability, but a poorly designed cloud system can still fail like a single VPS.
Businesses need availability because downtime can stop revenue, break user trust, and create operational emergencies.
High availability is the design practice of removing single points of failure and keeping service available during partial failure.
If you can identify single points of failure, you can design HA for web apps, databases, queues, Kubernetes, DNS, CDN, and AI services.
Architecture Decision
three_tier
web_application
public_cloud
Users reach DNS and CDN, traffic goes to a regional load balancer, compute runs in multiple zones, database uses replica or managed HA, backups are tested, monitoring detects failure.
Design for failure at every layer instead of trusting one component.
Cloud providers offer building blocks, but the architect must combine them correctly.
Run everything on one VM; use one zone only; forget health checks; take backups but never test restore; ignore database failover.
More availability costs more money and complexity; not every workload needs the same uptime target.
Cloud Building Blocks
Run multiple app instances across zones or use managed compute with zone redundancy.
Use load balancers, health checks, multiple subnets, multiple zones, DNS failover where appropriate, and controlled routing.
Use replicated storage or durable managed storage; do not rely on one disk as the only copy.
Use managed HA database, replicas, automated backups, tested restore, and clear RTO and RPO targets.
Protect failover systems with the same IAM, encryption, secret handling, and network isolation as primary systems.
Monitor uptime, error rate, latency, saturation, database replication, backup success, and failover events.
Enterprise Readiness
Use multi-zone compute, load balancer health checks, database HA, backup testing, and documented failover steps.
Scale horizontally, use stateless application design, cache safe reads, and use queues to absorb traffic spikes.
Keep failover resources private, restrict database access, and audit emergency permissions.
Match HA level to business value; do not build expensive multi-region systems for low-value internal tools.
Confirm alert, identify failed layer, remove unhealthy target, check database status, verify traffic path, and communicate incident status.
Failure & Job Readiness
Single-zone app, unhealthy load balancer target, database primary failure, bad deployment, DNS misconfiguration, expired certificate.
Check zones; check target health; check database failover; check backup restore; check certificate; check rollback plan.
Fail over database, remove bad instance, rollback release, restore backup, shift DNS or load balancer traffic, scale capacity.
An ecommerce site must remain online during one zone outage because each hour of downtime loses revenue.
What is the difference between backup, disaster recovery, and high availability?
Create a high-availability design for a small ecommerce site with two app instances, managed database HA, object storage, CDN, monitoring, and rollback.
Disaster Recovery; Load Balancer; RTO; RPO; Backup; Monitoring