Ip2Location (and IPinfoDB) Performance Tips

by Brian Hitney 9. March 2010 14:00

I’ve done a number of talks lately on Worldmaps and typically in side conversations/emails, people are curious about the databases and converting IP addresses to geographic locations.   And, often when you dive into using the data, it seems there are a number of performance considerations and I thought I’d share my input on these topics.

First up, the data.  Worldmaps uses two databases for IP resolution.  The primary/production database is Ip2Location.  I’ve found this database to be very accurate.  For development/demo purposes, I use IPinfoDB.  I haven’t had too much time to play with this database yet, but so far seems accurate also.   The latter is free, whereas Ip2Location is not.

In either case, the schema is nearly identical:

image

The BeginIp and EndIp columns are a clustered primary key.  In the case of IPinfoDB, there is no EndIp field (and it’s not really needed).  When performing a resolution, a string IP address is converted into a 64 bit integer and then used in searching the table.  That’s why having a clustered key on the BeginIp (and optionally EndIp) is crucial to performance.

But it doesn’t stop there.   The examples posted in the database’s respective home pages are accurate and simple, but need to be refactored for performance.  For example, to do a simple resolution on Ip2Location, according to their docs:

SELECT * FROM dbo.Ip2Location WHERE @IpNum BETWEEN BeginIp and EndIp

And for IPInfoDB:

SELECT TOP 1 * FROM IPInfoDB where BeginIp <= @IpNum ORDER BY BeginIp DESC

Both of these methods are perfectly fine, particularly for use as generic samples.  The second one is on the right track, but it doesn’t work for joins so if you’re querying over a range, you’d need to refactor.  And in the first example, using a BETWEEN operator forces a clustered index scan when joining, killing the performance.

If we run the first example across my minified Ip2LocationSmall table, we’ll see something like this (and this is running off of SQL Azure – the perf is pretty great compared to localhost!):

image

We can also look at the time:

image

Ouch!  Now, it doesn’t seem too bad, but imagine doing thousands of these requests per minute, or doing large joins. 

The goal then is to provide some hints that will optimize the query, particularly for joins.  Our indexes are correct, so we can rework the query to get rid of the BETWEEN operator – we can sacrifice a little readability and do something like:

SELECT *   
FROM (
    select
        ( select MAX(beginip)
          from dbo.Ip2LocationSmall
          where BeginIp <= @IpNum
        ) as IP_Begin           
    ) as foo
INNER JOIN dbo.Ip2LocationSmall iploc
ON iploc.BeginIp = foo.IP_Begin

The result:

image

And the time shows some improvement:

image

But the REAL benefit comes when we need to join.   Suppose I’d like to get a list of the countries for a given map (which is a parameter called MapId):

SELECT DISTINCT(ip.CountryCode)
FROM MapHits hits
INNER JOIN dbo.Ip2LocationSmall ip
    ON hits.IpNum BETWEEN ip.BeginIp AND ip.EndIp
WHERE MapId = @MapId

The query returns 95 rows, and executes in 16 seconds:

image

image

In this case, we can refactor this using the method above to something like:

SELECT
    DISTINCT(CountryCode)   
FROM (
     select IpNum,
        ( select MAX(beginip)
          from Ip2LocationSmall
          where BeginIp <= IpNum
        ) as IP_Begin
      from dbo.MapHits as hits
      where MapId = @MapId
  ) as foo
INNER JOIN Ip2LocationSmall iploc
ON iploc.BeginIp = foo.IP_Begin

Again, not as pretty looking, but boy what a difference:

image

image

We went from 16,500 milliseconds to 260 – over 60x the performance!  Mike @AngryPets would be proud.   The reason for the perf gain is we were able to eliminate the nested loop, which is (in this case) scanning the entire clustered index for the matching rows.

The second benefit is the ability to switch schemas easily between IP2Location and IPinfoDB, and we can additionally lose the EndIp column which trims the size of the table.

Tags: , , ,

SQL Azure | SQL | Tech Tips | Development

SQL Azure Logins

by Brian Hitney 13. February 2010 05:00

SQL Azure currently has fairly limited management capabilities.  When you create a database, you receive an administrator account that is tied to your login (you can change the SQL Azure password, though).  Because there is no GUI for user management, there’s a temptation to use this account in all your applications, but I highly recommend you create users for your application that have limited access. 

If you limit access to only stored procedures, you need to specify execute permissions.  Assuming you want your connection to have execute permissions on all stored procedures, I recommend a new role that has execute permissions.  That way, you can simply add users to this role and as you add more stored procedures, it simply works.  To create this role, you can do something like this:

CREATE ROLE db_executor
GRANT EXECUTE TO db_executor

Now in the master database (currently, you need to do this in a separate connection – just saying ‘use master’ won’t work) you can create your login for the database:

CREATE LOGIN MyUserName 
    WITH PASSWORD = 'Password';

In your application database, you need to create a user – in this case, we’ll just create a user with the same name as the login:

CREATE USER MyUserName FOR LOGIN MyUserName;

Next, we’ll specify the appropriate roles.  Depending on your needs, you may need only datareader.  I recommend db_owner only if necessary.

-- read/write/execute permissions
EXEC sp_addrolemember N'db_datareader', N'MyUserName'
EXEC sp_addrolemember N'db_datawriter', N'MyUserName'
EXEC sp_addrolemember N'db_executor', N'MyUserName'

-- only if you need dbo access:
EXEC sp_addrolemember N'db_owner', N'MyUserName'

You can continue to customize as necessary, as long as you are familiar with the appropriate T-SQL. 

Tags: , ,

Azure | SQL | SQL Azure

SQL Azure Transport-Level Error

by Brian Hitney 3. February 2010 10:34

If you’ve been playing with SQL Azure, you might have run into this error when opening a connection to the database:

A transport-level error has occurred when sending the request to the server. (provider: TCP Provider, error: 0 - An established connection was aborted by the software in your host machine.)

Ick!  Anything at that transport level surely can’t be a problem with my code, can it?  :)

The goods news is that you aren’t going crazy and the problem isn’t with your code.  There’s a difference in the way SQL Azure manages its connections compared to .NET connection pooling, and the result is that the connection pool is giving out connections that have already been closed.  When you try to do something with that connection, you get this error.   It’s sporadic in that only happens when you get an idle connection from the pool that has already been dropped by SQL Azure.

There are a couple of workarounds until a fix is implemented by Microsoft (I’ve been told it’s coming soon).  One method is to retry the connection (I hate this one, but it’s a viable option nonetheless).  It’s just messy and I’d have to do this in a few dozen places.  The amusing fix is to have a thread that continually pings the database, keeping every connection alive.  The best fix that I’ve found to date is to simply turn off connection pooling temporarily by adding a pooling=false option to the connection string:

image

I tested this on my webrole, leaving my workerroles as-is, and the webrole has been running for a week or two without a single error, whereas the workerrole (without disabling pooling) will get a couple errors every day.

I haven’t done any performance tests but UA testing (which is me) sees no appreciable hit, so I’ll go with this option until the permanent fix is deployed. 

Tags: , , ,

Azure | Development | SQL | Tech Tips

Leaderboard Tweaks and T-SQL Fun

by Brian Hitney 21. January 2010 06:01

With the chaos of the Worldmaps to Azure migration winding down, I decided to take a look at some the low-pri issues on my to-do list.   Those close to the project or who follow their leaderboard ranking closely know that last year, the leaderboard was ranked by Total Hits.  This worked pretty well at the time, but it made it impossible for those who joined the site weeks or months after others to compete unless their site had considerable more volume to close the gap.

This year, I changed the formula to base the data off of average hits per day.  This adds a bit more excitement because it’s a level playfield regardless of join date, for the most part.  The one remaining problem I have with hits per day is that if a site gains popularity over time, the resulting average encourages the user to delete and recreate their map for better rankings.  At some point in time, I’ll further refine this so it looks at hits per day for the last month or two (for example).

In the meantime, though, I got some heat from Chris Eargle who pointed out some problems with my implementation when he looked at his numbers.  The way the implementation worked was that it grabbed the earliest hit, the most recent hit, and divided by the number of days.   To prevent division by zero and give a minimum of 1 day, 1 was added to the number of days like so:

coalesce(sum(hits.NumHits) / (DATEDIFF(d, MIN(hits.CreateDate), MAX(hits.ModifyDate))+1) ,0) as HitsPerDay

While this worked “okay,” it wasn’t perfect.  For starters, using DATEDIFF returns an integer based on the dates passed in.  If you pass in “1/21/2010 23:59” and “1/22/2010 00:01,” the DATEDIFF (by days) is 1, even though only 2 minutes have passed.  If you pass in “1/22/2010 00:01” and “1/22/2010 23:59,” a zero is returned even though it’s just 2 minutes short of a day.  This isn’t wrong, but it’s limited for what we need because the leaderboard is generated throughout the day. 

Adding 1 can be incorrect because, using the first example, it will now return 2 days (!) when only 2 minutes have expired.  It’s a pretty extreme example but possible.  I wanted an implementation that, using T-SQL alone, would be a bit more consistent.  The criteria I had was: 1) had to support partial days and 2) had to give a minimum of 1 day.  #2 is because I don’t want a map that was just created (as in the first time example) to end up getting (again, for example) 50 hits in 2 minutes and due to the math, getting averaged at say 100,000 hits per day.  Using a minimum of 1 day makes the leaderboard a little more stable.  The resulting query:

CAST(FLOOR(
    SUM(NumHits)/
      CASE 
        WHEN (DATEDIFF(hh, MIN(hits.CreateDate), MAX(hits.ModifyDate))/cast(24 as float)) < 1 THEN 1        
        ELSE (DATEDIFF(hh, MIN(hits.CreateDate), MAX(hits.ModifyDate))/cast(24 as float))
      END
      )
as int) as HitsPerDay

This works out really well.  First, we use hours instead of days and simply divide by 24.  It’s essentially the same as days, but gives a fractional value.  If the value is less than 1, we’ll use just 1 as the minimum.  This solves the second criteria above and also any division by zero cases.  The ELSE of the CASE gives us the fractional days, so hits per day is more accurate and no more +1 tomfoolery.  Even though the FLOOR returns an whole number, the data type is unchanged so a cast to an int is necessary for the application.

If we look at the results in query analyzer, you can see the results are more true:

image

The “days” and “days_partial” shows the difference between using hours and days in DATEDIFF, giving us more accuracy.  The “HitsPerDay_New” compared to “HitsPerDay_Original” shows the more accurate result.  In some cases unchanged, in some cases much more accurate as you can see by looking at the “FirstHit” and “MostRecentHit” columns.

Anyway, just some fun playing with T-SQL and if you see some adjustments in the Leaderboard, this is why. 

But Chris, it won’t help you get into the Top 10. :)

Tags: , ,

Worldmaps | Development | SQL

your host...

Brian Hitney
Developer Evangelist
Microsoft Corp.

About Me

My Worldmap