Backing Up My Computer: A Strategy

I've lost data.

I think most of us have lost data at some point and I want to help people avoid it going forward. So to start off there's a few things to note about data backups.

You should have multiple backups. Backups can be out of date, corrupt, or otherwise unusable. It's also possible backups can be lost at the same time as the source data.

This is also why 'raid is not backup' is a popular phrase in some circles. Raids can seem like a backup but it's possible (even likely!) that you'll lose multiple disks in the array in quick succession. Without multiple backups there is a good chance you'll lose everything anyway.

You should have local and remote backups. If all your backups are local you're still vulnerable to data loss in the event of disasters like theft or fire. If all your backups are remote there are slowdowns when both backing up data and restoring it.

Slow data restores are annoying, but manageable.

However, when backing up data there is an unrecoverable window before data is backed up. The longer this window of time the more data can be lost when your primary storage goes down. Local backups allow this window to be minimized.

Backups are not backups unless they're restorable. The first time I backed up a machine I was 12 years old. This may sound like some great statement of accomplishment, but it's not.

While exploring the capabilities of the system I deleted everything on the hard drive (I was only 12 after all). Backup in hand I felt confident, but unfortunately I had no idea how to restore it or even what application I used to create it.

This is just one way backups may be useless. Backups can be corrupted or messed up in countless ways. It's important to use methodologies that are tried and true. Some common backup utilities have horrific reliability records.

You should also test backups periodically. You should, but realistically I don't expect people to actually do this. The likelihood of people testing is why I rate having multiple backups and the methodology choice as more important decisions.

My Data

I use a Mac as my primary machine. The rest of this post is going to somewhat reflect this. Some of the advice I give will apply if you have Windows, some wont. Regardless the spirit of what I do is the important part.

My data is divided into two major parts. First I have a primary drive which houses the Operating System and the majority of my applications and games. For me this is a 1TB Fusion Drive.

Secondly I have a media drive which houses iTunes library and other similar massive data sets. This is housed in an external Thunderbolt 2 enclosure on a set of four 4TB drives in a Raid 10 array. This gives me roughly 8TB usable.

My Backup Strategy

I actually have two separate strategies.

The Boot Drive

This is my critical data: documents, applications, configurations, my work, my life. I keep three different backups of this data.

First I have a 3TB external drive that I use as a target for Time Machine. This is the native OS X backup solution. It is powerful, reliable, multi-purpose, and revisioned. It keeps old versions of files around for years if space is available. This tier handles several problems stated above. It's local, has a known format, is easy to recover from, and keeps the unrecoverable failure window to (roughly) an hour or less.

Second I have a 1TB external drive that I use for Super Duper. It's a funny name, but it does something pretty incredible. It creates a bootable clone of primary drive. I can, and have, boot into the clone and copy it back to the main drive. This serves most of the same problems as the Time Machine backup, but the failure window is a day or so. Losing a day of work could be devastating, and way more expensive than an extra external drive.

Third I have Backblaze. This piece I use for both drives, and I'll mention it more below.

The Media Drive

This data I frankly consider less critical. I can restore just about everything on here without any backups given enough time, but it's a major hassle to get it all back and straight again. Consequently I have less protection on this data.

As mentioned this data is stored on a Raid 10 array. The raid is my First tier of backups for this drive. I know that 'raid is not a backup', but when used correctly it actually can be a good first step.

I could write a whole post about Raid setups and how to use them effectively. The short story is that Raid 0 is horrific for data reliability, Raid 5 and 1 have rebuild issues if the disk is bigger than about a Terabyte, and Raid 6 and 10 are pretty good. Raid 6 has less waste, but only past 4 disks. This is why I used Raid 10, although honestly Raid 6 may have been better from a fault tolerance standpoint.

As mentioned earlier I use Backblaze for this drive as my Second tier backup.

My Recommendations for You

My setup is probably too complex for most users. Especially those who may not be the most tech savvy. So here's a few basic recommendations for you to take away from this ramble.

First sign up and use a cloud based backup. I've used several, and obviously I recommend Backblaze.

  • It's affordable at $5 a month per computer. There's no per gigabyte storage costs.
  • It's native on multiple platforms; heavy cross platform apps bother me.
  • Uploads are not throttled on Backblaze's end.
  • Restores are straight forward downloads.
  • If you need 'fast' recovery of a ton of data they can ship you a drive (for the cost of the drive).

Full disclosure the links I've been putting here in Backblaze are referral links. You use them, I get a free month. You don't have to, but it'd be appreciated.

Second use a local backup too. On a Mac Time Machine and Super Duper are both great options. On Windows there's a built in option, but discussions with many people tell me it has questionable reliability. There's a lot of other off the shelf options, but I'm not sure which to recommend. Having one at all is more important than having the best. Don't less analysis paralysis stop you.

I hope this post helps at least one person avoid losing data whether it's your saved games, hours of development, or your kid's birthday photos.

Architecting in the Cloud -- Part 1

In mid 2012 I started preparing for a large event that was coming up for our company. It had been almost two years since we built our hosting infrastructure, and in a little over a year those leases would start expiring. We'd need to choose to keep with what we had, or get new hardware. (Spoiler: I picked the later, but not without good reason)

A little history first. When our current platform was designed and implemented it wasn't virtualized at all. It was 100% bare-metal, and that was engrained into some of the base decisions. Over the course of its lifetime we had need to start virtualizing our environment, and now we run over seven hundred virtual machines on top of an infrastructure that was never meant to be virtualized.

This technical shoehorning has led to a lot of interesting challenges which we've worked hard to eliminate. Last month we set an internal record and hit 99.9998% uptime on that patchwork infrastructure. So I believe it's safe to say that we've overcome the core challenges, but this does not mean we could just carry on as we have been. These are some of the factors:

  • Our hardware has been declared end of life by our vendor.
  • We've been hitting low-level limitations of our current hypervisor.
  • There's a major and very expensive growth point on the horizon.

Given these, I wanted to redesign our hosting from the ground up. I wanted an infrastructure designed to be a highly redundant and virtualized from day one. I wanted a cloud, but not a public one.

Why build a private Cloud?

This is honestly a good question. I actually think that the concept of the cloud is way oversold by many vendors. It's not a panacea, it's not going to fix all of your woes. Honestly, I'd be surprised if it didn't end up causing some new woes. Even if it's generic, like AWS, you still have to design around that platforms deficiencies and caveats. If it's less generic, like Google App Engine, or even certain facets of Windows Azure, you can be completely married to the cloud in question, and that could lead to an extremely painful and costly divorce.

I've also found that it is not universally true that you can save money by using the cloud. Compared to what I'm building, AWS costs an order of magnitude more for the same capacity with unreserved instances and 3-4 times more on reserved instances. Bandwidth through AWS would cost me roughly 6 times more as well.

There is a definitive line on size where salaries and other non-tangible costs are easily less than those differences. I believe we're easily past that with Curse's network of sites. I've got enough going around in my head about the Cloud in general that I'll probably end up writing out another post in the future that goes deeper into my thoughts on the subject.

I've gone on about all that to get to a point though. As a company, our bandwidth and compute needs don't fit well, or economically, into any of the public cloud offerings. Could we do it? Yes, absolutely. Would it save us money? Nope, we'd actually spend a lot more.

Virtualization however brings a lot that's very useful for us. We run a lot of different sites. Sites of all shapes, and colors, and uses, and demands. Using bare-metal for all of these sites would not even be remotely cost effective. I'd need hundreds of physical servers to reach the same level of redundancy and stability, and huge portions of it would sit their idle most if not all of the time. I believe the general benefits of virtualization are fairly well established so I'll stop there.

So public clouds are not suitable, but we need virtualization. The only reasonable solution I could find given those two things was to build a robust private cloud that we can grow on top of.

Design goals

In designing this magical private cloud I set out with a few major goals and consideration points:

  • Achievable "five nines" uptime. Our official SLA is only 99.95%, however I have a personal 99.999% goal. I hope to routinely hit it by end of year.
  • Avoiding hardware Maintenance related downtime. Currently a firmware upgrade on our SAN results in 30+ minutes of downtime.
  • Performance. Our applications can become extremely bottlenecked on CPU and/or Disk IO. Those bottlenecks must be mitigated as best possible.
  • Planned growth. Organic growth means the death of an otherwise good system. Each major component of the infrastructure must have a planned non-destructive growth point.
  • Reevaluate Hypervisor Selection. As mentioned we have been hitting fundamental bottlenecks in our current hypervisor.
  • Backups and disaster scenario response. There's no such thing as foolproof.
  • Keep costs reasonable. This is a balancing act with the other goals, as many time performance and reliably brings additional costs.

All of these goals I'm going to cover in more depth in future articles. I'll be posting the next part of this soon where I'll go over my thoughts on High Availability and Redundancy.

I invite anyone who's curious for more info or has questions to please engage with me on Twitter. I love geeking out about this kinda thing and it's a lot of fun to talk about.

This isn't a DDoS, it's an Arms Race

I mentioned it in my last post, but we've been under almost constant threat of DDoS attacks for the last many months. Up until a few years ago it wasn't a big deal. One or two guys would find a few machines to mess with us. The attacks where simplistic and came and went, and all in all weren't a huge problem. But as arms races go, things began to escalate. A new attack type would show up and we'd figure out how to stop it, and then we'd do our 'normal job' while we waited.

Now about a year or so ago things really started to heat up. We upgraded our hardware load balancers to a pair of decked out Brocade ADX 1000s (side note, the ADX platform is amazing, if people would like to see me write about why hit me up on twitter). We didn't do this as part of the arms race, but it had the added benefit of being able to protect against certain types of attacks, including syn floods and excessive request limits. Honestly, it handles these pretty well and we'd have been sunk without them.

Again this just caused further escalation over the next several months as we went back and forth ad nauseum, and then in the last six months or so we started seeing more sizable and sophisticated methods. One such attack figured out it'd take us approximately five to seven minutes to identify and block them so they started rotating IPs every four minutes or so.

My team and I have worked nights and weekends sometimes several times a week for months. For hours at a time we'd lose sleep and off time from these attack and the cascades of failures that would follow (remember how I said we were reevaluating hypervisors?). We were all frazzled and couldn't get a chance to really recover. Largely we had come to the conclusion that there was very little left for us to do. We needed to outsource our side of the conflict to a specialist.

Last week I went to our datacenter and we installed a pair of devices. These are specialized in stopping DDoS attacks before they ever reach the backend servers. Last Wednesday we turned them on and since then on they've identified almost non-stop attacks that account forty to sixty percent of all incoming packets. I expected it to see ongoing attacks, but honestly this ratio is much higher than I would have thought.

So far the devices have been very effective. We've seen little to no false positives, and they truly were virtually plug-n-play, something no one on my team has ever seen with a device that can filter millions of packets a second in real time.

Most of what we see are standard things we had a handle on before, but we've also seen at least four sizable attacks in the last four days. These were stopped in minutes instead of hours of combined downtime and cleanup efforts, but I can already see the other side escalating just over these four attacks.

One of the most mind boggling things about DDoS attacks is the cost asymmetry. You can launch a successful attack by renting a few AWS servers for less than ten dollars an hour, or even renting a full on botnet for twenty or so. To use a mitigation device like we're working with now the price tag is in the six figures with tens of thousands in ongoing support costs.

I sincerely believe that every dollar is money well spent. Even ignoring the costs of an outage and lost revenue, or brand damage and shaken user loyalty; you can hardly put a label on the intangible costs associated with responding to these attacks, especially when you feel powerless do much more than triage the damage.

Honestly though, between you and I, as great as I think these new devices are I have my reservations. No matter how much money, time, and resources go into the creation of these magical boxes and their mystical algorithms I expect it's only a matter of time before the combined mental fortitude of millions of angsty teenagers will find a fatal flaw. At least I can count on our new partner to escalate in turn with specializations and expertise that I can barely phantom.

Monday was a revelation for me. For the first time in months I felt rested and didn't feel like I was defeated before the week started. It was honestly such a sweet feeling that it literally brought tears to my eyes. For the first time in a long while, as the war rages on in the background, I feel like I can sleep soundly.

We're building a Cloud… We're building it bigger

For going on seven years I've been with Curse and we've managed to do many wild and diverse things, and it's been a great ride so far. I've got to work on one of the largest Django websites on the internet (at the time), I worked on a desktop application that is in use by two million people world wide (I later wrote the mac version from scratch), and for the last few years my main effort has been production level IT work.

I want to make it clear. This isn't your corporate stooge PC Load Letter style IT problems. We run one of the largest website networks in the country. Our monthly dynamic requests are numbered in the billions, we've broken twenty-five million uniques, and we are continually reminded by our vendors and partners that the challenges we face are in no way standard. Despite all that, we truly knew we arrived when we started having significant daily DDoS attacks (and there was much rejoicing...).

In those last seven years we've gone from fully managed dedicated servers, to collocated servers, to a private cage with some managed services, and now we're planning what the next iteration is going to look like. As much as I dislike buzz words it's going to look like a cloud.

We started virtualizing about two years ago in response to security and scalability concerns that only isolation could really fix. It's worked, we now run more than 800 vms in production, adding more all the time. The problem we ran into was hardware we had in place wasn't really meant for virtualization. Case in point: our LAN switch at the time. It had 192 ports, and was fine pre-virtualization. What we didn't realize is that it's MAC address table only had room for about 200 entries in it. The new virtualized environment grew and we ended up with more than 500 MAC addresses in fairly short order. We decided to replace the old switch when we saw it was dropping more than 35 million packets a day on certain ports.

I point all of that out to illustrate that we didn't really know what we were getting into. We were growing organically and reactively, and that's a problem. It hasn't really stopped yet. Every time I've turned around for the last few years that bare metal hardware stack has been pushed to the limits and forced to grow in ways it was never meant to.

So here we are going on three years later, with more than a few battle scars but a lot wiser for the wear. We started planning for the new hosting more than six months ago, and during that time I've had one overriding mantra: Planned flexible growth is a must. Never grow organically or reactively. My second mantra has been: Everything fails, don't let anyone notice. I honestly don't know that I'll fully be able to achieve those lofty goals. Eventually something we couldn't possibly imagine will rear it's ugly head and force me to react. I'll damn sure try to make it three years before that day comes though.

So what am I building? Here's a few bullet points.

  • Three distinct availability zones. Each zone will have dedicated and isolated networking, compute and management servers, as well as it's own SAN.
  • Capacity will be planned so that a whole zone can be taken off line for maintenance, or, in a more spectacular moment, can crash without noticeably impacting uptime or performance.
  • Avoid 1gbit networking wherever possible, instead prefer 10gbit, it should last longer.
  • Avoid spinning disks. Use SSD storage for all critical applications.
  • Use the best hypervisor tech. That's a powder keg statement, but it's important. We're reevaluating our original hypervisor choice (Citrix's Xenserver), and are trying out some new things.

We've made good progress so far and as mentioned we're in the prototyping stage. I'm planning on blogging about a lot of the details as we go forward. What I'll say for now is that the bleeding edge can be painful, but it's definitely exciting.