Storing Data in Azure: SQL, Tables, or Blobs?

by Brian Hitney 28. March 2011 10:01

While building the back end to host our “Rock, Paper, Scissors in the cloud” game, we faced a situation of where/how to store the log files for the games that are played.   In my last post, I explained a bit about the idea; in the game, log files are essential at tuning your bot to play effectively.  Just to give a quick example of what the top of a log file might look like: 

image

In this match, I (bhitney) was playing a house team (HouseTeam4) … each match is made up of potentially thousands of games, with one game per line.    From the game’s perspective, we only care about the outcome of the entire match, not the individual games within the match – but we need to store the log for the user. 

There’s no right or wrong answer for storing data – but like everything else, understanding the pros and cons is the key. 

Azure Tables

We immediately ruled out Azure Tables, simply because the entity size is too big.   But what if we stored each game (each line of the log) in an Azure Table?    After all, Azure Tables shine at large, unstructured data.   This would be ideal because we could ask specific questions of the data – such as, “show me all games where…”.  Additionally, size is really not a problem we’d face – tables can scale to TBs. 

But, storing individual games isn’t a realistic option.  The number of matches played for a 100 player match 4,950.  Each match has around 2,000 games, so that means we’d be looking at 9,900,000 rows per round.   At a few hundred milliseconds per insert, it would take almost a month to insert that kind of info.  Even if we could get latency to a blazing 10ms, it would still take over a day to insert that amount of data.    Cost wise, it wouldn’t be too bad: about $10 per round for the transaction costs.

Blob Storage

Blob storage is a good choice as a file repository.  Latency-wise, we’d still be looking at 15 minutes per round.  We almost went this route, but since we’re using SQL Azure anyway for players/bots, it seemed excessive to insert metadata into SQL Azure and then the log files into Blob Storage.  If we were playing with tens of thousands of people, that kind of scalability would be really important.   But what about Azure Drives?   We ruled drives out because we wanted the flexibility of multiple concurrent writers. 

SQL Azure

Storing binary data in a database (even if that binary data is a text file) typically falls under the “guilty until proven innocent” rule.  Meaning: assume it’s a bad idea.  Still, though, this is the option we decided to pursue.  By using gzip compression on the text, the resulting binary was quite small and didn’t add significant overhead to the original query used to insert the match results to begin with.  Additionally, the connection pooling makes those base inserts incredibly fast – much, much faster that blob/table storage.

One other side benefit to this approach is that we can serve the GZip stream without decompressing it.  This saves processing power on the web server, and also takes a 100-200k log file to typically less than 10k, saving a great deal of latency and bandwidth costs.

Here’s a simple way to take some text (in our case, the log file) and get a byte array of the compressed data.  This can then be inserted into a varbinary(max) (or deprecated image column) in a SQL database:

   1: public static byte[] Compress(string text)
   2: {
   3:    byte[] data = Encoding.UTF8.GetBytes(text);
   4:    var stream = new MemoryStream();
   5:    using (Stream ds = new GZipStream(stream, CompressionMode.Compress))
   6:    {
   7:        ds.Write(data, 0, data.Length);
   8:    }
   9:  
  10:    byte[] compressed = stream.ToArray();
  11:  
  12:    return compressed;
  13: }

And to get that string back:

   1: public static string Decompress(byte[] compressedText)
   2: {
   3:     try
   4:     {
   5:         if (compressedText.Length == 0)
   6:         {
   7:             return string.Empty;
   8:         }
   9:  
  10:         using (MemoryStream ms = new MemoryStream())
  11:         {
  12:             int msgLength = BitConverter.ToInt32(compressedText, 0);
  13:             ms.Write(compressedText, 0, compressedText.Length - 0);
  14:  
  15:             byte[] buffer = new byte[msgLength];
  16:  
  17:             ms.Position = 0;
  18:             using (GZipStream zip = new GZipStream(ms, CompressionMode.Decompress))
  19:             {
  20:                 zip.Read(buffer, 0, buffer.Length);
  21:             }
  22:  
  23:             return Encoding.UTF8.GetString(buffer);
  24:         }
  25:     }
  26:     catch
  27:     {
  28:         return string.Empty;
  29:     }
  30: }
 
In our case, though, we don’t really need to decompress the log file because we can let the client browser do that!  In our case, we have an Http Handler that will do that, and quite simply it looks like:
 
   1: context.Response.AddHeader("Content-Encoding", "gzip");
   2: context.Response.ContentType = "text/plain";
   3: context.Response.BinaryWrite(data.LogFileRaw); // the byte array
   4: context.Response.End();

Naturally, the downside of this approach is that if a browser doesn’t accept GZip encoding, we don’t handle that gracefully.   Fortunately it’s not 1993 anymore, so that’s not a major concern.

Tags: , , , ,

Azure | Development | USCloud | SQL Azure

Ip2Location (and IPinfoDB) Performance Tips

by Brian Hitney 9. March 2010 19:00

I’ve done a number of talks lately on Worldmaps and typically in side conversations/emails, people are curious about the databases and converting IP addresses to geographic locations.   And, often when you dive into using the data, it seems there are a number of performance considerations and I thought I’d share my input on these topics.

First up, the data.  Worldmaps uses two databases for IP resolution.  The primary/production database is Ip2Location.  I’ve found this database to be very accurate.  For development/demo purposes, I use IPinfoDB.  I haven’t had too much time to play with this database yet, but so far seems accurate also.   The latter is free, whereas Ip2Location is not.

In either case, the schema is nearly identical:

image

The BeginIp and EndIp columns are a clustered primary key.  In the case of IPinfoDB, there is no EndIp field (and it’s not really needed).  When performing a resolution, a string IP address is converted into a 64 bit integer and then used in searching the table.  That’s why having a clustered key on the BeginIp (and optionally EndIp) is crucial to performance.

But it doesn’t stop there.   The examples posted in the database’s respective home pages are accurate and simple, but need to be refactored for performance.  For example, to do a simple resolution on Ip2Location, according to their docs:

SELECT * FROM dbo.Ip2Location WHERE @IpNum BETWEEN BeginIp and EndIp

And for IPInfoDB:

SELECT TOP 1 * FROM IPInfoDB where BeginIp <= @IpNum ORDER BY BeginIp DESC

Both of these methods are perfectly fine, particularly for use as generic samples.  The second one is on the right track, but it doesn’t work for joins so if you’re querying over a range, you’d need to refactor.  And in the first example, using a BETWEEN operator forces a clustered index scan when joining, killing the performance.

If we run the first example across my minified Ip2LocationSmall table, we’ll see something like this (and this is running off of SQL Azure – the perf is pretty great compared to localhost!):

image

We can also look at the time:

image

Ouch!  Now, it doesn’t seem too bad, but imagine doing thousands of these requests per minute, or doing large joins. 

The goal then is to provide some hints that will optimize the query, particularly for joins.  Our indexes are correct, so we can rework the query to get rid of the BETWEEN operator – we can sacrifice a little readability and do something like:

SELECT *   
FROM (
    select
        ( select MAX(beginip)
          from dbo.Ip2LocationSmall
          where BeginIp <= @IpNum
        ) as IP_Begin           
    ) as foo
INNER JOIN dbo.Ip2LocationSmall iploc
ON iploc.BeginIp = foo.IP_Begin

The result:

image

And the time shows some improvement:

image

But the REAL benefit comes when we need to join.   Suppose I’d like to get a list of the countries for a given map (which is a parameter called MapId):

SELECT DISTINCT(ip.CountryCode)
FROM MapHits hits
INNER JOIN dbo.Ip2LocationSmall ip
    ON hits.IpNum BETWEEN ip.BeginIp AND ip.EndIp
WHERE MapId = @MapId

The query returns 95 rows, and executes in 16 seconds:

image

image

In this case, we can refactor this using the method above to something like:

SELECT
    DISTINCT(CountryCode)   
FROM (
     select IpNum,
        ( select MAX(beginip)
          from Ip2LocationSmall
          where BeginIp <= IpNum
        ) as IP_Begin
      from dbo.MapHits as hits
      where MapId = @MapId
  ) as foo
INNER JOIN Ip2LocationSmall iploc
ON iploc.BeginIp = foo.IP_Begin

Again, not as pretty looking, but boy what a difference:

image

image

We went from 16,500 milliseconds to 260 – over 60x the performance!  Mike @AngryPets would be proud.   The reason for the perf gain is we were able to eliminate the nested loop, which is (in this case) scanning the entire clustered index for the matching rows.

The second benefit is the ability to switch schemas easily between IP2Location and IPinfoDB, and we can additionally lose the EndIp column which trims the size of the table.

Tags: , , ,

SQL Azure | SQL | Tech Tips | Development

SQL Azure Logins

by Brian Hitney 13. February 2010 10:00

SQL Azure currently has fairly limited management capabilities.  When you create a database, you receive an administrator account that is tied to your login (you can change the SQL Azure password, though).  Because there is no GUI for user management, there’s a temptation to use this account in all your applications, but I highly recommend you create users for your application that have limited access. 

If you limit access to only stored procedures, you need to specify execute permissions.  Assuming you want your connection to have execute permissions on all stored procedures, I recommend a new role that has execute permissions.  That way, you can simply add users to this role and as you add more stored procedures, it simply works.  To create this role, you can do something like this:

CREATE ROLE db_executor
GRANT EXECUTE TO db_executor

Now in the master database (currently, you need to do this in a separate connection – just saying ‘use master’ won’t work) you can create your login for the database:

CREATE LOGIN MyUserName 
    WITH PASSWORD = 'Password';

In your application database, you need to create a user – in this case, we’ll just create a user with the same name as the login:

CREATE USER MyUserName FOR LOGIN MyUserName;

Next, we’ll specify the appropriate roles.  Depending on your needs, you may need only datareader.  I recommend db_owner only if necessary.

-- read/write/execute permissions
EXEC sp_addrolemember N'db_datareader', N'MyUserName'
EXEC sp_addrolemember N'db_datawriter', N'MyUserName'
EXEC sp_addrolemember N'db_executor', N'MyUserName'

-- only if you need dbo access:
EXEC sp_addrolemember N'db_owner', N'MyUserName'

You can continue to customize as necessary, as long as you are familiar with the appropriate T-SQL. 

Tags: , ,

Azure | SQL | SQL Azure

your host...

Brian Hitney
Developer Evangelist
Microsoft Corp.

About Me

My Worldmap