In part 1 (
read it here) I discussed the
prerequisite data that we needed to act as a foundation for developing
geographic-based reports. In this part, I'll discuss how we map those IP addresses
to geographic locations, and finally, in part 3, we'll plot the data on a map.
When it comes to figuring out where a user is located, there's only two pieces
of information we need: latitude and longitude. Other data may be useful for other reports (like Country, Currency, etc., but all that could be derived from the latitude and longitude). It's not surprising that
this is so difficult: IP addresses (obviously) are not assigned by geographic
location (though, IPs in subnets are frequently geographically close and we can
use that to our advantage), so we can't simply infer this like we do with other
information like the user agent. And even if we knew where an IP was
located, how do we handle proxy servers or NAT?
The answer is: we don't. Proxies and NAT are, generally speaking, not transparent (meaning, we can't see behind the proxy). In some rare instances, a proxy may forward an HTTP_X_FORWARDED_FOR header, but most proxy servers don't (and some even put random IP addresses in there). If I'm on east coast and VPN into my home network (or use a
proxy server) on the west coast, and then visit some websites, it will look
like I'm visiting from the west coast.
So there's a certain amount of inaccuracy
we'll have to live with and it's the way of the internet -- perfect, 100%
accuracy is simply not possible. Geolocation of IP addresses will be particularly inaccurate with mobile or satellite based systems.
There are a few possible approaches to resolve the data. One way is by data mining your visitors. For example, based on IP
address alone, we can look up the IP owner through a WHOIS search on one
of the registrars like ARIN or RIPE. If your site requires registration,
you can cross reference that registration information with the WHOIS lookup and
likely make a good guess. Of course, users may be visiting from work or
an internet cafe or a friend's house, so you'd need to evaluate IPs on a case
by case basis. But with a little logic, time, and volume, your guesses would likely be reasonably accurate. As I pointed out in my original
post, each pixel on a world map that's 720 pixels wide is about 30 square miles -- not
exactly requiring laser precision.
Of course, we don't want to spend the time to investigate each IP address ourselves. Fortunately, a LOT of people want this
information, and thanks to e-commerce, sites often sell your information and,
because of the demand, it's pretty easy to get. Like credit bureaus, if a
company can collect enough information from enough sources, the resolution and
accuracy begins to ramp up. My IP resolves as Bellevue, WA -- not exactly
correct, but close enough and is technically pixel-accurate on my 720 pixel map.
In fact, IP resolution is very much like credit bureaus -- individuals are
scored based on individual history. If there is no history, he or
she can thank their neighbors and the number of liquor stores in their
neighborhood for their credit score, because that's what they use to determine
credit worthiness. It's profiling, and research shows it's accurate.
Likewise, these providers combine and cross reference data from multiple sources to profile someone, even if they've never surfed the internet before.
So let's do what everyone else does and buy the information, and to a certain
degree, get it for free. In my quest, I found a few companies: the first
is CDYNE. They offer a web service called IP2Geo that accepts an IP
address, and returns a bunch of really cool information. Want to give a
try? Head to their asmx test page here and see if it works for you. While I liked the convenience of a web
service, the accuracy was really off (up to half the globe off) on a few of my test cases (for example, one IP that I know is in Seattle resolves as being in Tel Aviv). Some
feedback I saw online suggested the same thing.
The second I looked at was IP2Location.
Unfortunately, their database offering that included latitude and longitude was
a bit pricey, so I needed to pass. However, while searching for
information on this company, I noticed that another company called FraudLabs offers the IP2Location
database over a web service (see their site here).
You can sign up for a trial key and then subscribe at reasonable rates (about 4 cents per resolution, depending on volume).
The third provider I looked at (and
ultimately selected) was GeoBytes.
There's a server-side solution you can purchase (like IP2Location), but I liked
the flexibility of "MapBytes" -- a pay-per-resolution model they offer. Of
the three, I felt GeoBytes' website wasn't quite as professional or clean, but they
nailed my test cases, responded to 2 questions almost immediately -- one on a
Saturday and the other on a Sunday, and there's a support forum to boot.
Plus, there's another incentive: they offer 20 lookups per hour for free (and I don't know about you, but I rarely get more than 20 new and unique IPs per hour ... for that matter, I rarely get more than 20 requests per hour!).
It would be ideal if they offered a web service, but instead the data is retrieved by a raw HTTP
GET. The response is given in a template file that you can specify (I use
the XML template). I then parse the XML and store the result. I
purchased a set of MapBytes that essentially let me go above the 20/hour limit
and seeded the bulk of my data (each resolution costs about 1/2 cent), allowing the rest to resolve over time to stay
within the 20/hour limit.
Using MapBytes with a programmatic HTTP GET is a bit complicated. The
request requires a session token in the URL which means you have to log in to
get the token. While this can probably be accomplished programmatically,
it's a pain and requires parsing the response headers for token. This step isn't needed for the free 20/hour lookups -- only
going above this number requires the access token, and it expires after 90
minutes of inactivity. GeoBytes told me they're working on a web service,
so hopefully they'll release that soon and resolutions will be much
easier. Their forums have a lot of information to help with this process. So far, though, I can highly recommend MapBytes without question. In fact, to put it to the test, I'm going to have a "Show Me" page that will try to guess your location, and I'll put a little survey on there to see how accurate it is.
Once you have the latitude and longitude and can map it to the data we gathered
in step 1, you're almost home. If you're looking to get off the ground
with GeoBytes, here's some test C# code to make the request. It will make a request to any URL and return a string containing the data received (note, however, that I'm assuming an ISO-8859-1 encoding that GeoBytes uses).
public string FetchAddress(string
address)
{
StringBuilder
sb = new StringBuilder();
HttpWebRequest
httpReq;
HttpWebResponse
httpRsp = null;
try
{
httpReq = (HttpWebRequest)WebRequest.Create(address);
httpReq.AllowAutoRedirect = true;
httpReq.Timeout = 30000;
httpRsp = (HttpWebResponse)httpReq.GetResponse();
byte[]
buf = new byte[1024];
Stream
resStream = httpRsp.GetResponseStream();
int count
= 0;
do
{
// fill
the buffer with data
count = resStream.Read(buf, 0,
buf.Length);
// read
some data and append it to our string
if
(count != 0)
{
sb.Append(Encoding.GetEncoding("ISO-8859-1").GetString(buf, 0, count));
}
}
while
(count > 0);
return
sb.ToString();
}
catch (Exception ex)
{
//do whatever
handling is appropriate, or just...
return null;
}
finally
{
if
(httpRsp != null)
{
httpRsp.Close();
}
}
}
Getting the results from GeoBytes can be done by simply calling FetchAddress (where the string IpAddress is the IP Address to query) like this:
string IpAddress
= "1.1.1.1";
string Url = string.Format(
"http://www.geobytes.com/IpLocator.htm?GetLocation&template=xml.txt&IpAddress={0}",
IpAddress
);
string result =
FetchAddress(Url);
//oops, not all data coming back is encoded...
result = result.Replace("&",
"&");
You'll note a minor encoding issue on their end on the line above: in one case, I noticed the comment field in their
dataset contained an unencoded ampersand, which of course caused an
exception to be raised when I loaded this as XML. They're correcting
the issue and this hack is just a temporary stopgap, and that only happened in one out of several thousand lookups.
Cut and paste the URL in the code above into your browser to give it a try (put an IP address in the {0} at the end, of course). The XML you get back should look something like this:
<?xml version="1.0" encoding="iso-8859-1" ?>
<info>
<IP>141.150.42.139</IP>
<countryid>119</countryid>
<country>Italy</country>
<fips>IT</fips>
<iso2>IT</iso2>
<iso3>ITA</iso3>
<ison>380</ison>
<internet>IT</internet>
<comment></comment>
<regionid>2246</regionid>
<region>Lombardia</region>
<code>LO</code>
<adm>IT09</adm>
<cityid>11937</cityid>
<city>Milano</city>
<latitude>45.4670</latitude>
<longitude>9.2000</longitude>
<timezone>+01:00</timezone>
<dmaid></dmaid>
<dma></dma>
<market></market>
<certainty>50</certainty>
<locationcode>ITLOMILA</locationcode>
<ipaddress>141.150.42.139</ipaddress>
</info>
Parsing the XML is fairly trivial ... for a
description of each field, visit GeoByte's tag information page
here.
Note the certainty field -- it's provided by the vendor to indicate how
conclusive the lookup is -- a higher number, of course, means there's a greater
chance of the data being accurate (some of the data in the example above is fictional). In some cases, though, the results
will be empty. When I investigated these, they are almost always some type
of spider or bot that slipped passed ReverseDOS and other log filtering.
If you're going to pursue geolocation, GeoByte's website has a lot of great info on the technology and process, and what you can expect from an accuracy point of view.
To recap, we now know what IPs visited the site, the date/time and pages (if you've chosen to capture that info), and we have a likely location. With this step complete, we can move on to step 3, plotting that info on a map!
