Architecting in the Cloud -- Part 1

In mid 2012 I started preparing for a large event that was coming up for our company. It had been almost two years since we built our hosting infrastructure, and in a little over a year those leases would start expiring. We'd need to choose to keep with what we had, or get new hardware. (Spoiler: I picked the later, but not without good reason)

A little history first. When our current platform was designed and implemented it wasn't virtualized at all. It was 100% bare-metal, and that was engrained into some of the base decisions. Over the course of its lifetime we had need to start virtualizing our environment, and now we run over seven hundred virtual machines on top of an infrastructure that was never meant to be virtualized.

This technical shoehorning has led to a lot of interesting challenges which we've worked hard to eliminate. Last month we set an internal record and hit 99.9998% uptime on that patchwork infrastructure. So I believe it's safe to say that we've overcome the core challenges, but this does not mean we could just carry on as we have been. These are some of the factors:

  • Our hardware has been declared end of life by our vendor.
  • We've been hitting low-level limitations of our current hypervisor.
  • There's a major and very expensive growth point on the horizon.

Given these, I wanted to redesign our hosting from the ground up. I wanted an infrastructure designed to be a highly redundant and virtualized from day one. I wanted a cloud, but not a public one.

Why build a private Cloud?

This is honestly a good question. I actually think that the concept of the cloud is way oversold by many vendors. It's not a panacea, it's not going to fix all of your woes. Honestly, I'd be surprised if it didn't end up causing some new woes. Even if it's generic, like AWS, you still have to design around that platforms deficiencies and caveats. If it's less generic, like Google App Engine, or even certain facets of Windows Azure, you can be completely married to the cloud in question, and that could lead to an extremely painful and costly divorce.

I've also found that it is not universally true that you can save money by using the cloud. Compared to what I'm building, AWS costs an order of magnitude more for the same capacity with unreserved instances and 3-4 times more on reserved instances. Bandwidth through AWS would cost me roughly 6 times more as well.

There is a definitive line on size where salaries and other non-tangible costs are easily less than those differences. I believe we're easily past that with Curse's network of sites. I've got enough going around in my head about the Cloud in general that I'll probably end up writing out another post in the future that goes deeper into my thoughts on the subject.

I've gone on about all that to get to a point though. As a company, our bandwidth and compute needs don't fit well, or economically, into any of the public cloud offerings. Could we do it? Yes, absolutely. Would it save us money? Nope, we'd actually spend a lot more.

Virtualization however brings a lot that's very useful for us. We run a lot of different sites. Sites of all shapes, and colors, and uses, and demands. Using bare-metal for all of these sites would not even be remotely cost effective. I'd need hundreds of physical servers to reach the same level of redundancy and stability, and huge portions of it would sit their idle most if not all of the time. I believe the general benefits of virtualization are fairly well established so I'll stop there.

So public clouds are not suitable, but we need virtualization. The only reasonable solution I could find given those two things was to build a robust private cloud that we can grow on top of.

Design goals

In designing this magical private cloud I set out with a few major goals and consideration points:

  • Achievable "five nines" uptime. Our official SLA is only 99.95%, however I have a personal 99.999% goal. I hope to routinely hit it by end of year.
  • Avoiding hardware Maintenance related downtime. Currently a firmware upgrade on our SAN results in 30+ minutes of downtime.
  • Performance. Our applications can become extremely bottlenecked on CPU and/or Disk IO. Those bottlenecks must be mitigated as best possible.
  • Planned growth. Organic growth means the death of an otherwise good system. Each major component of the infrastructure must have a planned non-destructive growth point.
  • Reevaluate Hypervisor Selection. As mentioned we have been hitting fundamental bottlenecks in our current hypervisor.
  • Backups and disaster scenario response. There's no such thing as foolproof.
  • Keep costs reasonable. This is a balancing act with the other goals, as many time performance and reliably brings additional costs.

All of these goals I'm going to cover in more depth in future articles. I'll be posting the next part of this soon where I'll go over my thoughts on High Availability and Redundancy.

I invite anyone who's curious for more info or has questions to please engage with me on Twitter. I love geeking out about this kinda thing and it's a lot of fun to talk about.

This isn't a DDoS, it's an Arms Race

I mentioned it in my last post, but we've been under almost constant threat of DDoS attacks for the last many months. Up until a few years ago it wasn't a big deal. One or two guys would find a few machines to mess with us. The attacks where simplistic and came and went, and all in all weren't a huge problem. But as arms races go, things began to escalate. A new attack type would show up and we'd figure out how to stop it, and then we'd do our 'normal job' while we waited.

Now about a year or so ago things really started to heat up. We upgraded our hardware load balancers to a pair of decked out Brocade ADX 1000s (side note, the ADX platform is amazing, if people would like to see me write about why hit me up on twitter). We didn't do this as part of the arms race, but it had the added benefit of being able to protect against certain types of attacks, including syn floods and excessive request limits. Honestly, it handles these pretty well and we'd have been sunk without them.

Again this just caused further escalation over the next several months as we went back and forth ad nauseum, and then in the last six months or so we started seeing more sizable and sophisticated methods. One such attack figured out it'd take us approximately five to seven minutes to identify and block them so they started rotating IPs every four minutes or so.

My team and I have worked nights and weekends sometimes several times a week for months. For hours at a time we'd lose sleep and off time from these attack and the cascades of failures that would follow (remember how I said we were reevaluating hypervisors?). We were all frazzled and couldn't get a chance to really recover. Largely we had come to the conclusion that there was very little left for us to do. We needed to outsource our side of the conflict to a specialist.

Last week I went to our datacenter and we installed a pair of devices. These are specialized in stopping DDoS attacks before they ever reach the backend servers. Last Wednesday we turned them on and since then on they've identified almost non-stop attacks that account forty to sixty percent of all incoming packets. I expected it to see ongoing attacks, but honestly this ratio is much higher than I would have thought.

So far the devices have been very effective. We've seen little to no false positives, and they truly were virtually plug-n-play, something no one on my team has ever seen with a device that can filter millions of packets a second in real time.

Most of what we see are standard things we had a handle on before, but we've also seen at least four sizable attacks in the last four days. These were stopped in minutes instead of hours of combined downtime and cleanup efforts, but I can already see the other side escalating just over these four attacks.

One of the most mind boggling things about DDoS attacks is the cost asymmetry. You can launch a successful attack by renting a few AWS servers for less than ten dollars an hour, or even renting a full on botnet for twenty or so. To use a mitigation device like we're working with now the price tag is in the six figures with tens of thousands in ongoing support costs.

I sincerely believe that every dollar is money well spent. Even ignoring the costs of an outage and lost revenue, or brand damage and shaken user loyalty; you can hardly put a label on the intangible costs associated with responding to these attacks, especially when you feel powerless do much more than triage the damage.

Honestly though, between you and I, as great as I think these new devices are I have my reservations. No matter how much money, time, and resources go into the creation of these magical boxes and their mystical algorithms I expect it's only a matter of time before the combined mental fortitude of millions of angsty teenagers will find a fatal flaw. At least I can count on our new partner to escalate in turn with specializations and expertise that I can barely phantom.

Monday was a revelation for me. For the first time in months I felt rested and didn't feel like I was defeated before the week started. It was honestly such a sweet feeling that it literally brought tears to my eyes. For the first time in a long while, as the war rages on in the background, I feel like I can sleep soundly.

We're building a Cloud… We're building it bigger

For going on seven years I've been with Curse and we've managed to do many wild and diverse things, and it's been a great ride so far. I've got to work on one of the largest Django websites on the internet (at the time), I worked on a desktop application that is in use by two million people world wide (I later wrote the mac version from scratch), and for the last few years my main effort has been production level IT work.

I want to make it clear. This isn't your corporate stooge PC Load Letter style IT problems. We run one of the largest website networks in the country. Our monthly dynamic requests are numbered in the billions, we've broken twenty-five million uniques, and we are continually reminded by our vendors and partners that the challenges we face are in no way standard. Despite all that, we truly knew we arrived when we started having significant daily DDoS attacks (and there was much rejoicing...).

In those last seven years we've gone from fully managed dedicated servers, to collocated servers, to a private cage with some managed services, and now we're planning what the next iteration is going to look like. As much as I dislike buzz words it's going to look like a cloud.

We started virtualizing about two years ago in response to security and scalability concerns that only isolation could really fix. It's worked, we now run more than 800 vms in production, adding more all the time. The problem we ran into was hardware we had in place wasn't really meant for virtualization. Case in point: our LAN switch at the time. It had 192 ports, and was fine pre-virtualization. What we didn't realize is that it's MAC address table only had room for about 200 entries in it. The new virtualized environment grew and we ended up with more than 500 MAC addresses in fairly short order. We decided to replace the old switch when we saw it was dropping more than 35 million packets a day on certain ports.

I point all of that out to illustrate that we didn't really know what we were getting into. We were growing organically and reactively, and that's a problem. It hasn't really stopped yet. Every time I've turned around for the last few years that bare metal hardware stack has been pushed to the limits and forced to grow in ways it was never meant to.

So here we are going on three years later, with more than a few battle scars but a lot wiser for the wear. We started planning for the new hosting more than six months ago, and during that time I've had one overriding mantra: Planned flexible growth is a must. Never grow organically or reactively. My second mantra has been: Everything fails, don't let anyone notice. I honestly don't know that I'll fully be able to achieve those lofty goals. Eventually something we couldn't possibly imagine will rear it's ugly head and force me to react. I'll damn sure try to make it three years before that day comes though.

So what am I building? Here's a few bullet points.

  • Three distinct availability zones. Each zone will have dedicated and isolated networking, compute and management servers, as well as it's own SAN.
  • Capacity will be planned so that a whole zone can be taken off line for maintenance, or, in a more spectacular moment, can crash without noticeably impacting uptime or performance.
  • Avoid 1gbit networking wherever possible, instead prefer 10gbit, it should last longer.
  • Avoid spinning disks. Use SSD storage for all critical applications.
  • Use the best hypervisor tech. That's a powder keg statement, but it's important. We're reevaluating our original hypervisor choice (Citrix's Xenserver), and are trying out some new things.

We've made good progress so far and as mentioned we're in the prototyping stage. I'm planning on blogging about a lot of the details as we go forward. What I'll say for now is that the bleeding edge can be painful, but it's definitely exciting.

Loading Templates with Django's app_directories.Loader

I don't really like having one monolithic templatee folder. You end in with a scenario where you have to go digging, in a completely separate path for the template that goes with a given view, and that can be very disrupting when you're in the middle of working on something.

In my current proejct I wanted to use Django's app_directories loader to try to keep the templates physically close to the views that use them. Then I started working with it and I realized how annoying it'd be to use!

When I create a listing page I generally name the template file index.html or maybe listing.html depending on it's usage. Edit pages are normally edit.html, details pages are normally details.html and so on. However, the app_directories's loader class doesn't differentiate in any way between your different modules. This is the root of my issue.

In some magical world it'd use the source of the call to identify what was the most likely module, and then maybe do some guesses for likely fallbacks. But that's not really sane to try to implement, hell I don't even think it's really possible.

The official method is to build subdirectories under the application template folders. This means you'd have projectname/appname/templates/appname/index.html or projectname/appname/subappname/templates/appname/subappname/index.html. The other common 'fix' for this is to prefix your filenames so they'd all be appname-index.html or appname-subappname.html and the like.

I don't really care for either. The pathing solution ends up with a lot of redundancy, and you still have to maintain the kinda structure I'm wanting to avoid, and even if the structure is localized it's still annoying to have to go through. The prefixing just seems sloppy, and if you ever do accidentally have a conflicting name, which can happen, you could have a potentially hard to spot bug.

The best idea I've thought of so far to fix this is to have your render methods look like this: render_to_string(template_app, template_name, dictionary=None, context_instance=None). This way you just pass in the app name when you call it. Now I don't care for the inherent redundancy this would cause either, but it seems to me to be the lesser of the evils.

The idea had me digging into the internals of how Django works with loading templates and what I needed to do to make this really work. What I found wasn't that bad. The Loader api is pretty simple, however, there was a few undocumented and non obvious requirements. Also you can't circumvent the Loader system completely as that breaks other facets of the framework. Some of this would have been extremely painful to figure out if PyCharm didn't have such a great debugger.

It's pretty straight forward to use.

TEMPLATE_LOADERS = (('appname.common.templates.AppSpecificLoader',),)

if not DEBUG:
    TEMPLATE_LOADERS = (('django.template.loaders.cached.Loader',
        ('appname.common.templates.AppSpecificLoader',)),
    )

Then you just need to import the render method and use it instead of the default, alternatively there is also a django-annoying inspired decorator that can be used to revent the redundant code many of us have for building the response object.

Here is a link to the code.