Why We Moved Off The Cloud
Cloud computing is often positioned as a solution to scalability problems. In fact, it seems like almost every day I read a blog post about a company moving infrastructure to the cloud. At Mixpanel, we did the opposite. I’m writing this post to explain why and maybe even encourage some other startups to consider the alternative.
First though, I wanted to write a short bit about the advantages of cloud servers since they are ideal for some use cases.
- Low initial costs. Specifically, you can get a cloud server for less than $20. Even the cheapest dedicated servers (and I wouldn’t recommend the cheapest) will cost more than $50. For new companies, this can make a difference.
- Fast deployment times and hourly billing.If you have variable traffic and you’re not having problems scaling your data persistence layer, you can fairly easy spin up and spin down servers quickly in response to usage patterns. It’s worth pointing out that I specifically mean variable traffic rather than growing traffic. From purely an ease of deployment standpoint, handling even quickly growing traffic is fairly easy on both cloud and dedicated platforms.
- Cheap CPU performance.If your application is purely CPU bound, then you can end up with great price/performance ratios. Most cloud servers allow a single small node on a physical server to use more than its fair share of CPU resources if they are otherwise underutilized — and they often are. One of the last bits of our infrastructure still on the cloud is CPU bound and even though we pay for very small Rackspace cloud servers, we get the performance of dedicated hardware.
The cloud’s intractable problem
… is variable — no, highly variable — performance. We’ve spent a lot of effort designing our infrastructure to scale horizontally so poor performance is not much of a problem, it just means buying more machines. However, highly variable performance is incredibly hard to code or design around (think a server that normally does 300 queries per second with low I/O wait suddenly dropping to 50 queries second at 100% disk utilization for literally hours). It’s solvable, certainly, but with lots of time and money and it’s hard to justify the cost when there’s a better alternative available.
The fundamental problem with cloud servers is that you’re at the mercy of your neighbors. If they decide to “dd if=/dev/zero of=/dev/sda”, there’s not a whole lot you can do about it other than migrating to a different physical server (and it’s really hard to decide whether to wait it out or migrate, especially because zero down time migrations are always little painful). Even worse, at Mixpanel’s level of disk usage that migration can easily take more than a day. In other words, you better hope that your neighbor never runs that command or anything that even looks like it from a disk utilization perspective. Side note: based on observations over a few months, I’m pretty sure that Rackspace actually does the equivalent of a full virtual disk wipe every time a customer deletes a cloud server. Better hope that none of your neighbors ever decides to cancel their server!
To be clear, variable performance forced us off the cloud, but I thought I would point out a couple other cloud disadvantages too:
- One size fits all. The cloud, even AWS, offers very little customization compared to dedicated hardware. We recently added a new backup machine with a crappy CPU, little RAM, and 24 2TB drives in a hardware RAID 6 configuration. You can’t get that from a cloud provider and if you find something similar it’s going to cost an order of magnitude more than what we’re paying.
- No access to bleeding edge hardware. At Mixpanel, some of our codebase is highly optimized low level C. We’ve profiled, tweaked, and made sure we’re not missing anything obvious. My point is the only way this code is going to run in less time is if we get faster hardware. Dedicated hosting providers usually stay on top of new hardware (specifically, the latest CPU’s and SSD’s). On the cloud, you’re usually stuck with whatever the provider got a volume discount on.
After getting fed up with variable cloud performance, I decided to make the move to dedicated hardware. This isn’t a decision to take lightly. It literally took months to move the most important parts of our infrastructure. Starting migrations at 8 p.m. on a Friday and waking up early Saturday morning to finish them off isn’t so much fun either.
After deciding to go dedicated, the next step is choosing a provider. We got competing quotes from a number of companies. One thing that I was surprised by — and this really doesn’t seem to be the case with the cloud — is that pricing is highly variable and you have to be prepared to negotiate everything. The difference between ordering at face value and either getting a competing quote or simply negotiating down can be as much at 50-75% off. As an engineer, this type of sales process is tiring, but once you have a good feel for what you should be paying and what kind of discount you can reasonably get, the negotiations are pretty quick and painless.
We ultimately decided to go with Softlayer for a number of reasons:
- No contracts. I don’t think I really need to explain the advantage. You would think that you could get better prices by signing 1 or 2 year contracts, but interestingly enough, out of the initial 5 providers we talked to the two that didn’t require contracts had the best prices.
- Wide selection. Softlayer seems to keep machines around for a while and you can get very good deals on last year’s hardware. Most of the other providers we contacted would only provision brand new hardware and you pay a premium.
- Fast deployment. Softlayer isn’t quite at the cloud level for deployment times, but we usually get machines within 2-8 hours or so. That’s good enough for our purposes. On the other hand, a lot other hosting companies have deployment times measured in days or worse.
One last thing about getting dedicated hardware. It’s cheaper… a lot cheaper. We have machines that give us 2-4x performance that cost less than half as much as their cloud equivalents and we’re not even co-locating (which has its own set of hassles).
We’ve moved 100% of our machines that rely upon performant disks to dedicated servers hosted at Softlayer. Roughly speaking, this corresponds to about 80% of our hosting costs. Eventually, we’ll move everything both for ease of management and bandwidth savings (a lot of our traffic could be internal to a datacenter if all of our machines we’re hosted in the same place).
Since I started this migration, our traffic has grown more than ten-fold. At the same time, our infrastructure has gotten significantly faster, more reliable, and interestingly enough cheaper (at the per machine level). Most importantly, the amount of time I’ve spent fixing server issues late at night or on weekend has decreased to almost nothing.
I hope this post has convinced at least one growing to startup to consider dedicated hardware a lot sooner than we did. Honestly, as soon as you first start to see issues with inconsistent or poor disk performance, you should probably move. It will save you a lot of late nights, development time, and grief.