So, Amazon cloud services (AWS) is the first to fail – and much attributed to human error during an upgrade. From an architectural perspective, what have those who rely on Amazon services learned?
Since it is going to take time for our cloud service providers to mature to what we might call an acceptable, robust industry operations level, what have we learned from our architectural perspective to make what is available today to work better for us?
- Know what is a Could Service and What is Not. I will suggest that 75% of what, today, is being sold as a “cloud” service is simply a hosted infrastructure for some product. Period. I contend the tenants of cloud are (1) Redundancy, (2) Elasticity, and (3) a simple Pay-As-You-Go (PAYG) financial model. Those who tout themselves as cloud service providers, but are not, are setting themselves up for a greater fall than the AWS situation.
- State[less] Management Rules. Architecting solutions that are as stateless as possible means that you have no focal point in which your logic depends. State is the new single point of failure as found when maintained in data repositories that a not redundant in real-time.
Excuse the short tangent – but – I think the future place to store state will ultimately be in the client – like client devices such as phones, pads, etc.
- Elasticity is Tricky. Real Tricky. My exposure to cloud solutions has pointed out that automated elasticity is a rarity. Most that I have seen is done by manual anticipation of spike activity. Netflix uses their “N+1” Redundancy model to protect themselves. Others manually configure their fabric according to anticipated activity. Continued manual experimentation will be necessary before automated means can be designed. Once better understood, automated elasticity will take a generous amount of experimental iterations as, initially, automation simply allows us to break things much faster.
- React to Poor Health Before Failure. Many are identifying that to work in the cloud, you must: Fail early. Fail often. Deal with it. If we monitor health prior to failure, we can make necessary modifications to our service operations to mitigate failure. This involves trend analysis algorithms to react in real-time to trend changes as they begin, not after they fail.
- Real Cloud Storage is in it’s Infancy. Using models that mimic your antiquated relational practices in the cloud may be the cause of your demise. Many new cloud storage deviations from the norm are surfacing that warrant learning a new way over making the cloud work like your existing internal systems. If you get burned by using your traditional storage mechanisms in a cloud service, then you are the responsible party when it fails.
- Cloud Solutions Need Operational Automation. Those AWS customers who faired well in the burst and those who failed poorly have one thing in common. Their success or failure is attributed to how they manually reacted from an operations perspective. This is key, as all have stated future success will require more automation from the operations perspective. This simply means that we cannot rely on the cloud service provider’s operational architecture to take care of our interests. Rather, we must create our own frameworks around theirs to promote greater operational excellence to survive their inadequacies.
- Learn and Grow from this Experience. Now is not the time to bail on cloud computing, rather, it is the time to recognize it is becoming a more mature. The lesson here, however, is to make your cloud entry strategy one of moderation. Do not throw all your service eggs into one cloud basket as there are more maturity lessons to be learned. Better to learn them with less mission critical solutions so that when the time comes, both the cloud service providers and your expertise will have evolved to create a solid architecture around those mission critical candidate cloud solution opportunities.