Architecting in the Cloud -- Part 1

In mid 2012 I started preparing for a large event that was coming up for our company. It had been almost two years since we built our hosting infrastructure, and in a little over a year those leases would start expiring. We'd need to choose to keep with what we had, or get new hardware. (Spoiler: I picked the later, but not without good reason)

A little history first. When our current platform was designed and implemented it wasn't virtualized at all. It was 100% bare-metal, and that was engrained into some of the base decisions. Over the course of its lifetime we had need to start virtualizing our environment, and now we run over seven hundred virtual machines on top of an infrastructure that was never meant to be virtualized.

This technical shoehorning has led to a lot of interesting challenges which we've worked hard to eliminate. Last month we set an internal record and hit 99.9998% uptime on that patchwork infrastructure. So I believe it's safe to say that we've overcome the core challenges, but this does not mean we could just carry on as we have been. These are some of the factors:

  • Our hardware has been declared end of life by our vendor.
  • We've been hitting low-level limitations of our current hypervisor.
  • There's a major and very expensive growth point on the horizon.

Given these, I wanted to redesign our hosting from the ground up. I wanted an infrastructure designed to be a highly redundant and virtualized from day one. I wanted a cloud, but not a public one.

Why build a private Cloud?

This is honestly a good question. I actually think that the concept of the cloud is way oversold by many vendors. It's not a panacea, it's not going to fix all of your woes. Honestly, I'd be surprised if it didn't end up causing some new woes. Even if it's generic, like AWS, you still have to design around that platforms deficiencies and caveats. If it's less generic, like Google App Engine, or even certain facets of Windows Azure, you can be completely married to the cloud in question, and that could lead to an extremely painful and costly divorce.

I've also found that it is not universally true that you can save money by using the cloud. Compared to what I'm building, AWS costs an order of magnitude more for the same capacity with unreserved instances and 3-4 times more on reserved instances. Bandwidth through AWS would cost me roughly 6 times more as well.

There is a definitive line on size where salaries and other non-tangible costs are easily less than those differences. I believe we're easily past that with Curse's network of sites. I've got enough going around in my head about the Cloud in general that I'll probably end up writing out another post in the future that goes deeper into my thoughts on the subject.

I've gone on about all that to get to a point though. As a company, our bandwidth and compute needs don't fit well, or economically, into any of the public cloud offerings. Could we do it? Yes, absolutely. Would it save us money? Nope, we'd actually spend a lot more.

Virtualization however brings a lot that's very useful for us. We run a lot of different sites. Sites of all shapes, and colors, and uses, and demands. Using bare-metal for all of these sites would not even be remotely cost effective. I'd need hundreds of physical servers to reach the same level of redundancy and stability, and huge portions of it would sit their idle most if not all of the time. I believe the general benefits of virtualization are fairly well established so I'll stop there.

So public clouds are not suitable, but we need virtualization. The only reasonable solution I could find given those two things was to build a robust private cloud that we can grow on top of.

Design goals

In designing this magical private cloud I set out with a few major goals and consideration points:

  • Achievable "five nines" uptime. Our official SLA is only 99.95%, however I have a personal 99.999% goal. I hope to routinely hit it by end of year.
  • Avoiding hardware Maintenance related downtime. Currently a firmware upgrade on our SAN results in 30+ minutes of downtime.
  • Performance. Our applications can become extremely bottlenecked on CPU and/or Disk IO. Those bottlenecks must be mitigated as best possible.
  • Planned growth. Organic growth means the death of an otherwise good system. Each major component of the infrastructure must have a planned non-destructive growth point.
  • Reevaluate Hypervisor Selection. As mentioned we have been hitting fundamental bottlenecks in our current hypervisor.
  • Backups and disaster scenario response. There's no such thing as foolproof.
  • Keep costs reasonable. This is a balancing act with the other goals, as many time performance and reliably brings additional costs.

All of these goals I'm going to cover in more depth in future articles. I'll be posting the next part of this soon where I'll go over my thoughts on High Availability and Redundancy.

I invite anyone who's curious for more info or has questions to please engage with me on Twitter. I love geeking out about this kinda thing and it's a lot of fun to talk about.