Monday, April 25, 2011

Cloud Computing | Amaon AWS EC2 Outage

Lesson Learned from recent AWS EC2 outage on April 21st, 2011.

More complex systems are less stable. We need to build solution for failure and that exactly is the response to situation like this one. Majority of the folks who survived this outage or has minimal impacts has few commonalties to share:

1. Solution should be spread across multiple availability zones ( In literal means these are physical datacenters)

2. For mission critical application such as banking, heath care, government. Spread your solution to multiple regions, there are 5 regions in which AWS operates:

a. US East Region

b. US West Region

c. EU West Region

d. Singapore

e. Tokyo (Lately experienced the impact of Tsunami) ..

And for mission critical applications, existence of DR is must. We all know it.

3. Spread across multiple providers, you may choose to deploy solutions across multiple players for more resilience.

4. It is rather imperative that you must understand AWS thoroughly prior to deploy the solution. EBS which seems failed this time is equivalent to RAID 1/5/6/10 etc. So for mission critical applications and high availability solution, look for spot instances instead of EBS instances.

Outages of this sort are learning curve for providers, system integrator (SI's) and consumers. Cloud computing is just an tool like any other and its leverage is in our hands.

Will it happen again, I am sure it will !

Though there will be less outages as we progress on and becomes more proficient to manage our solution on public cloud.