Azure SLA is something that gets discussed quite a bit but there’s something that I see causing a bit of confusion. The SLA for Azure compute instances states: For compute, we guarantee that when you deploy two or more role instances in different fault and upgrade domains, your internet facing roles will have external connectivity at least 99.95% of the time. Some folks (for example, this post) incorrectly conclude that you need to deploy your solution across 2 or more datacenters to get this SLA. Actually, that’s not true – you just need to make sure they are in different fault and upgrade domains. This is something that is typically done by default. You can think of a fault domain as a physical separation in a different rack, so if there’s a hardware failure on the server or switch, it only affects instances within the same fault domain. Upgrade domains are logical groupings that control how deployments are upgraded. For large deployments, you may have multiple upgrade domains so that all roles within an upgrade domain are upgraded as a group. To illustrate this, I spun up 3 instances of Worldmaps running on my local Dev Fabric. I have an admin tool in the site that shows all current instances, their role, and their domain affiliation: The admin page uses the RoleEnvironment class to check status of the roles (more on this in another post), but also display their fault and upgrade domains. (A value of “0f” is fault domain 0. “0u” is upgrade domain 0, and so on). So by default, my three instances are in separate fault and upgrade domains that correspond to their instance number. All of these instances are in the same datacenter, and as I long as I have at least 2 instances and ensure they have different fault and upgrade domains (which is the default behavior), I’m covered by the SLA. The principal advantage of keeping everything within the same datacenter is cost savings between roles, storage, and SQL Azure. Essentially, any bandwidth within the data center (for example, my webrole talking to SQL Azure or Azure Storage) incurs no bandwidth cost. If I move one of my roles to another datacenter, traffic between datacenters is charged. Note however there are still transaction costs for Azure Storage. This last fact brings up an interesting and potentially beneficial side effect. While I’m not trying to get into the scalability differences between Azure Table Storage and SQL Azure, from strictly a cost perspective, it could be infinitely more advantageous to go with SQL Azure in some instances. As I mentioned in my last post, Azure Storage transaction costs might creep up and surprise you if you aren’t doing your math. If you’re using Azure Table Storage for session and authentication information and have a medium volume site (say, less than 10 webroles but that’s just my off the cuff number – it really depends on what your applications are doing), SQL Azure represents a fixed cost whereas Table Storage will vary based on traffic to your site. For example, a small SQL Azure instance at $9.99/month = $0.33/day. Azure Table transactions are $0.01 per 10,000. If each hit to your site made only 1 transaction to storage, that would mean you could have 330,000 hits per day to achieve the same cost. Any more, and SQL Azure becomes more attractive, albeit with less scalability. In many cases, it’s possible you wouldn’t need to go to table storage for every hit, but then again, you might make several transactions per hit, depending on what you’re doing. This is why profiling your application is important. More soon!
The latest Visual Studio Tools for Windows Azure (v 1.1, Feb 2010) was just released and can be downloaded here. It does not support VS2010 Beta 2, so you’ll need to either use VS2008, or wait until VS2010 RC in a few weeks. I’m really excited about this release for 1 immediate to-do in my code, and 1 would-be-fun to-do in my code. First, Windows Azure Drive beta is available in this release (called XDrive at PDC). Windows Azure Drive lets you mount an NTFS virtual hard disk (VHD) between one or more roles (onle 1 can mount the drive as for writing). The drive itself is stored in Azure Blob Storage, but behind the scenes there are some nice features (like caching) to make it a great option for Azure storage, particularly if you’re migrating an application that made extensive use direct disk I/O. Now, since the November release, Queue storage included a dequeue count property that allows you to have visibility into how many times a message has been dequeued. But, the StorageClient included with the VS tools didn’t have this property, so until now you’d have to do your own implementation to get the value. Seeing the dequeue count is pivotal in dealing with poison messages. A poison message, in queue parlance, is a message that contains malformed data that ultimately causes the queue processor to throw an exception. The result is that message isn’t processed, stays in the queue, only to be processed again and with the same result. Depending on the visibility timeout of the message (which is 30 seconds by default) this can be a disaster. Looking at the dequeue count can be key to discovering a poison message. For example, suppose you have the following exception handler: catch (Exception e)
if (msg != null && msg.DequeueCount > 2)
If an exception was raised in attempting to process the message that wasn’t otherwise handled, we’ll get here. This is my outermost exception handler. In a workerrole, you _always_ want to make sure you have the opportunity to catch any exception, because otherwise the role will exit and recycle.
In this case, I can check the dequeue count and simply delete the message if it’s already been dequeued 3 or more times (an arbitrary number on my end, chosen because of the relatively high timeout which means the message will live for an hour before being discarded). If we hit that number, I’m going to discard the message. Optionally, I could log it to an invalid queue table, or put it in another queue, etc. The important thing is that we recognize a poison message and deal with it. With this particular queue, missing a message isn’t that critical so I can just delete the message and move on.
I have a few posts along the lines of scaling an Azure application, but thought I’d throw this tidbit out. You might be developing your own management system for your Azure deployments, and in doing so, having visibility into the roles running within your deployment is very handy. The idea behind this series of posts is to ultimately explore how to get visibility into roles, and then a few ways we can tackle inter-role communication. The first task is to define internal endpoints for each of our roles. If you don’t do this, you will only have visibility into the current role. So, if you have a single role solution (any number of instances) than you don’t have to do this. Let's take a pretty simple example where we want to enumerate all the roles in our deployment, and their instances: foreach (var appRole in RoleEnvironment.Roles)
var instances = RoleEnvironment.Roles[appRole.Key].Instances;
foreach (var instance in instances)
While this code will work fine, by default it will only show you the instances for the current role – other roles will show an instance count of zero. Running this in my workerrole and trying to enumerate my webroles, I see this:
What we can do is define and internal endpoint for our roles that will allow this data to be discoverable. In Visual Studio, it’s simply a matter of going into the properties of the role and defining an internal endpoint:
And we’re done. Debugging the solution again:
We can do the same for our workerroles to be discoverable by other webroles, etc.
In some future posts, I’ll build off this idea for doing some lightweight service management and role communication.
If you’ve been playing with SQL Azure, you might have run into this error when opening a connection to the database: A transport-level error has occurred when sending the request to the server. (provider: TCP Provider, error: 0 - An established connection was aborted by the software in your host machine.) Ick! Anything at that transport level surely can’t be a problem with my code, can it? :) The goods news is that you aren’t going crazy and the problem isn’t with your code. There’s a difference in the way SQL Azure manages its connections compared to .NET connection pooling, and the result is that the connection pool is giving out connections that have already been closed. When you try to do something with that connection, you get this error. It’s sporadic in that only happens when you get an idle connection from the pool that has already been dropped by SQL Azure. There are a couple of workarounds until a fix is implemented by Microsoft (I’ve been told it’s coming soon). One method is to retry the connection (I hate this one, but it’s a viable option nonetheless). It’s just messy and I’d have to do this in a few dozen places. The amusing fix is to have a thread that continually pings the database, keeping every connection alive. The best fix that I’ve found to date is to simply turn off connection pooling temporarily by adding a pooling=false option to the connection string: I tested this on my webrole, leaving my workerroles as-is, and the webrole has been running for a week or two without a single error, whereas the workerrole (without disabling pooling) will get a couple errors every day. I haven’t done any performance tests but UA testing (which is me) sees no appreciable hit, so I’ll go with this option until the permanent fix is deployed.
I admit that site design is something that often takes a back seat when developing Worldmaps, but in the recent migration to Windows Azure, I added a few new minor customizations to the maps. When you log into your account, you should see a list that contains all of your current maps: From this screen, you can either modify a map, or create a new one. When creating a new map, you’ll see a simple form to fill out: All of the information that is required for the map is the URL and Leaderboard. The latitude and longitude fields indicate your home location. You can use the home locator map on the bottom of the screen to help in this regard, or you can leave this blank. Without this information, some statistics cannot be calculated. The Leaderboard indicates which category best suits your map. Is it a personal blog? A technology blog? A personal site? Pick the one that best fits your site – this can be changed later. The last box, Invitation Code, is for premium accounts and changes the way the data is stored for the given map. I described this briefly in my last post. For scalability reasons, most new accounts will be under the new scheme – if you need more detailed information (such as # of unique IPs), contact me for invitation code. Customizing your Map Once your map is created, click the edit button next to your map to customize the colors. The form should look similar to: When the app is drawing your maps, you can control the colors used in drawing the circles. Feel free to experiment with some color choices. The four values you can customize are explained on the form. Use the Silverlight-based Color Picker to find a color and copy that value into the text box of the value you’d like to change. Or, you can enter values directly into the box if you know the hex-value of the color you’d like to use. When done, click Save. In general, the maps will be redrawn reasonably quick depending on how much work is currently in the queue. Customizing the colors is a great way to add a little personalization to your maps!
The recent update to Windows Azure went quite well! The site is now using a single Azure webrole, a single Azure worker role, Azure Queues for workload, and Azure Blobs for storage. It’s also using SQL Azure as the database. From a user’s point of view, not much has changed but the performance and scalability has been much improved. On the stats page, I implemented a few new stats … First up is the hourly breakdown of hits to a site. Below is Channel 9’s current breakdown. Neat way to tell when the traffic is heaviest to your site. In this case, C9 is busiest at 3pm GMT, or about 9am-4pm EST. In addition, Worldmaps includes country breakdown information: And, Stumbler has been updated a bit so be sure to check it out and watch traffic in real time! Finally, there’s change to the registration process. To add some scalability, Worldmaps now stores data in one of two schemes. The older scheme has been migrated to what is called a “plus” or enhanced account. The newer scheme is the default, and it stores data in a much more aggregated way. What determines how information is stored? This is based off of an invitation code on the Create Map form: If no invitation code is provided, the newer scheme is used. If a valid invite code is provided, the old, more detailed method is used. If you’d like an invite code, drop me some feedback.What’s the difference? Currently, the difference is pretty small. On the stats page, current number of Unique IP's can not be calculated, so it looks like so: Future report options are a bit limited as well, but otherwise, all data (and Stumbler) is still available.
I have a fairly large Windows Azure migration I’m working on, and there are dozens of tips, recommendations, gotchas, etc. I’ve learned in the process. This is one small item that cost me quite a bit of time, and it’s so simple I’m detailing it here because someone will run into this one. First a bit of background: if you’re deploying a Windows Azure application, the package is uploaded and deployed as a whole. If you have dynamic content or individual files that are subject to change, it’s a good idea to consider placing it in Azure Storage, otherwise you’ll have to redeploy your entire application to update one of these files. In this case, I’d like to put a Silverlight XAP file in Azure storage, instead of the typical /ClientBin folder. There are a number of references on the net for setting this up – searching for “XAP” and “Azure” returns a lot of good references including this one. After checking my Silverlight project’s app manifest file, ensuring everything was correct, and uploading the XAP to my Azure storage account, my Silverlight app would refuse to run. The page would load, but then … nothing. I also checked Fiddler – the XAP file _was_ getting downloaded (and was downloading fine via a browser). This is typically a tell-tale sign of a policy issue – yet I was certain everything was correct. Here’s a screenshot (click for larger) of Fiddler downloading the file. Can you spot the problem here? The problem was indirectly with Cerebrata’s Cloud Storage Studio. Cerebrata’s product is really nice and I enjoy working with it quite a bit to work with Azure storage (uploading/downloading files). CloudBerry’s Explorer for Azure Blob Storage is another good one – I typically work with Cerebrata’s simply because it was easier to connect to development storage (not sure if this is possible in CloudBerry’s app). Fast-forward and hour or two of frustration. Staring at Cloud Storage Studio, I see: Zip file as a content type? This was in the Fiddler screenshot above, too. I figured this was fine because after all, a XAP file is a zip file. But as it turns out, this was the problem. For kicks, I tried re-uploading the XAP file from CloudBerry to see how it handled the content type, and: Cloud Storage Studio does allow you to alter the content types, but truthfully I didn’t think this was all that important. When I reloaded my application, though, the app loads and Fiddler comes to life as it should: For kicks I changed the content type back to Zip, and the app would fail to load. So, lesson learned! Don’t forget about the content types.
I just wanted to post a quick follow up to yesterday's MSDN Event on Windows Azure.I talked briefly about interop with other platforms, specifically in regards to REST/SOAP. At the time, I wasn't aware that we have a few samples out there ... specifically:http://www.dotnetservicesruby.com/.NET Services for Ruby is an open source software development kit (SDK) that helps Ruby programs communicate with Microsoft .NET Services using plain HTTP. Specifically the SDK includes set of REST libraries, tools, prescriptive patterns & guidance & sample applications that will enhance productivity for Ruby developers. Developers will be able to leverage the .NET Services to extend their Ruby applications by using the Microsoft cloud services platform to build, deploy and manage reliable, Internet-scale applications.http://www.jdotnetservices.com/The purpose of this project is to provide an interoperable open source software development kit(SDK) - set of libraries, tools, prescriptive patterns & guidance & real world sample applications that will enhance productivity for Java developers. Developers will be able to leverage the .NET Services to extend their Java applications by using the Microsoft cloud services platform to build, deploy and manage reliable, internet-scale applications.In addition, there's a number of nice resources on the .NET Service Bus here. Specifically, the white papers offer much more depth on the security and access control services in Azure .NET Services. For those in the session or reading the post that would like more info (particularly in the enterprise or B2B scenarios) be sure to check out this resource.