This isn't a DDoS, it's an Arms Race

I mentioned it in my last post, but we've been under almost constant threat of DDoS attacks for the last many months. Up until a few years ago it wasn't a big deal. One or two guys would find a few machines to mess with us. The attacks where simplistic and came and went, and all in all weren't a huge problem. But as arms races go, things began to escalate. A new attack type would show up and we'd figure out how to stop it, and then we'd do our 'normal job' while we waited.

Now about a year or so ago things really started to heat up. We upgraded our hardware load balancers to a pair of decked out Brocade ADX 1000s (side note, the ADX platform is amazing, if people would like to see me write about why hit me up on twitter). We didn't do this as part of the arms race, but it had the added benefit of being able to protect against certain types of attacks, including syn floods and excessive request limits. Honestly, it handles these pretty well and we'd have been sunk without them.

Again this just caused further escalation over the next several months as we went back and forth ad nauseum, and then in the last six months or so we started seeing more sizable and sophisticated methods. One such attack figured out it'd take us approximately five to seven minutes to identify and block them so they started rotating IPs every four minutes or so.

My team and I have worked nights and weekends sometimes several times a week for months. For hours at a time we'd lose sleep and off time from these attack and the cascades of failures that would follow (remember how I said we were reevaluating hypervisors?). We were all frazzled and couldn't get a chance to really recover. Largely we had come to the conclusion that there was very little left for us to do. We needed to outsource our side of the conflict to a specialist.

Last week I went to our datacenter and we installed a pair of devices. These are specialized in stopping DDoS attacks before they ever reach the backend servers. Last Wednesday we turned them on and since then on they've identified almost non-stop attacks that account forty to sixty percent of all incoming packets. I expected it to see ongoing attacks, but honestly this ratio is much higher than I would have thought.

So far the devices have been very effective. We've seen little to no false positives, and they truly were virtually plug-n-play, something no one on my team has ever seen with a device that can filter millions of packets a second in real time.

Most of what we see are standard things we had a handle on before, but we've also seen at least four sizable attacks in the last four days. These were stopped in minutes instead of hours of combined downtime and cleanup efforts, but I can already see the other side escalating just over these four attacks.

One of the most mind boggling things about DDoS attacks is the cost asymmetry. You can launch a successful attack by renting a few AWS servers for less than ten dollars an hour, or even renting a full on botnet for twenty or so. To use a mitigation device like we're working with now the price tag is in the six figures with tens of thousands in ongoing support costs.

I sincerely believe that every dollar is money well spent. Even ignoring the costs of an outage and lost revenue, or brand damage and shaken user loyalty; you can hardly put a label on the intangible costs associated with responding to these attacks, especially when you feel powerless do much more than triage the damage.

Honestly though, between you and I, as great as I think these new devices are I have my reservations. No matter how much money, time, and resources go into the creation of these magical boxes and their mystical algorithms I expect it's only a matter of time before the combined mental fortitude of millions of angsty teenagers will find a fatal flaw. At least I can count on our new partner to escalate in turn with specializations and expertise that I can barely phantom.

Monday was a revelation for me. For the first time in months I felt rested and didn't feel like I was defeated before the week started. It was honestly such a sweet feeling that it literally brought tears to my eyes. For the first time in a long while, as the war rages on in the background, I feel like I can sleep soundly.

We're building a Cloud… We're building it bigger

For going on seven years I've been with Curse and we've managed to do many wild and diverse things, and it's been a great ride so far. I've got to work on one of the largest Django websites on the internet (at the time), I worked on a desktop application that is in use by two million people world wide (I later wrote the mac version from scratch), and for the last few years my main effort has been production level IT work.

I want to make it clear. This isn't your corporate stooge PC Load Letter style IT problems. We run one of the largest website networks in the country. Our monthly dynamic requests are numbered in the billions, we've broken twenty-five million uniques, and we are continually reminded by our vendors and partners that the challenges we face are in no way standard. Despite all that, we truly knew we arrived when we started having significant daily DDoS attacks (and there was much rejoicing...).

In those last seven years we've gone from fully managed dedicated servers, to collocated servers, to a private cage with some managed services, and now we're planning what the next iteration is going to look like. As much as I dislike buzz words it's going to look like a cloud.

We started virtualizing about two years ago in response to security and scalability concerns that only isolation could really fix. It's worked, we now run more than 800 vms in production, adding more all the time. The problem we ran into was hardware we had in place wasn't really meant for virtualization. Case in point: our LAN switch at the time. It had 192 ports, and was fine pre-virtualization. What we didn't realize is that it's MAC address table only had room for about 200 entries in it. The new virtualized environment grew and we ended up with more than 500 MAC addresses in fairly short order. We decided to replace the old switch when we saw it was dropping more than 35 million packets a day on certain ports.

I point all of that out to illustrate that we didn't really know what we were getting into. We were growing organically and reactively, and that's a problem. It hasn't really stopped yet. Every time I've turned around for the last few years that bare metal hardware stack has been pushed to the limits and forced to grow in ways it was never meant to.

So here we are going on three years later, with more than a few battle scars but a lot wiser for the wear. We started planning for the new hosting more than six months ago, and during that time I've had one overriding mantra: Planned flexible growth is a must. Never grow organically or reactively. My second mantra has been: Everything fails, don't let anyone notice. I honestly don't know that I'll fully be able to achieve those lofty goals. Eventually something we couldn't possibly imagine will rear it's ugly head and force me to react. I'll damn sure try to make it three years before that day comes though.

So what am I building? Here's a few bullet points.

  • Three distinct availability zones. Each zone will have dedicated and isolated networking, compute and management servers, as well as it's own SAN.
  • Capacity will be planned so that a whole zone can be taken off line for maintenance, or, in a more spectacular moment, can crash without noticeably impacting uptime or performance.
  • Avoid 1gbit networking wherever possible, instead prefer 10gbit, it should last longer.
  • Avoid spinning disks. Use SSD storage for all critical applications.
  • Use the best hypervisor tech. That's a powder keg statement, but it's important. We're reevaluating our original hypervisor choice (Citrix's Xenserver), and are trying out some new things.

We've made good progress so far and as mentioned we're in the prototyping stage. I'm planning on blogging about a lot of the details as we go forward. What I'll say for now is that the bleeding edge can be painful, but it's definitely exciting.