Azure SLA is something that gets discussed quite a bit but there’s something that I see causing a bit of confusion. The SLA for Azure compute instances states:
For compute, we guarantee that when you deploy two or more role instances in different fault and upgrade domains, your internet facing roles will have external connectivity at least 99.95% of the time.
Some folks (for example, this post) incorrectly conclude that you need to deploy your solution across 2 or more datacenters to get this SLA. Actually, that’s not true – you just need to make sure they are in different fault and upgrade domains. This is something that is typically done by default. You can think of a fault domain as a physical separation in a different rack, so if there’s a hardware failure on the server or switch, it only affects instances within the same fault domain. Upgrade domains are logical groupings that control how deployments are upgraded. For large deployments, you may have multiple upgrade domains so that all roles within an upgrade domain are upgraded as a group.
To illustrate this, I spun up 3 instances of Worldmaps running on my local Dev Fabric. I have an admin tool in the site that shows all current instances, their role, and their domain affiliation:
The admin page uses the RoleEnvironment class to check status of the roles (more on this in another post), but also display their fault and upgrade domains. (A value of “0f” is fault domain 0. “0u” is upgrade domain 0, and so on). So by default, my three instances are in separate fault and upgrade domains that correspond to their instance number.
All of these instances are in the same datacenter, and as I long as I have at least 2 instances and ensure they have different fault and upgrade domains (which is the default behavior), I’m covered by the SLA.
The principal advantage of keeping everything within the same datacenter is cost savings between roles, storage, and SQL Azure. Essentially, any bandwidth within the data center (for example, my webrole talking to SQL Azure or Azure Storage) incurs no bandwidth cost. If I move one of my roles to another datacenter, traffic between datacenters is charged. Note however there are still transaction costs for Azure Storage.
This last fact brings up an interesting and potentially beneficial side effect. While I’m not trying to get into the scalability differences between Azure Table Storage and SQL Azure, from strictly a cost perspective, it could be infinitely more advantageous to go with SQL Azure in some instances. As I mentioned in my last post, Azure Storage transaction costs might creep up and surprise you if you aren’t doing your math. If you’re using Azure Table Storage for session and authentication information and have a medium volume site (say, less than 10 webroles but that’s just my off the cuff number – it really depends on what your applications are doing), SQL Azure represents a fixed cost whereas Table Storage will vary based on traffic to your site.
For example, a small SQL Azure instance at $9.99/month = $0.33/day. Azure Table transactions are $0.01 per 10,000. If each hit to your site made only 1 transaction to storage, that would mean you could have 330,000 hits per day to achieve the same cost. Any more, and SQL Azure becomes more attractive, albeit with less scalability. In many cases, it’s possible you wouldn’t need to go to table storage for every hit, but then again, you might make several transactions per hit, depending on what you’re doing. This is why profiling your application is important.