SiteSpider: New Version

Hard to believe, but it has been over 3 years since I spit out SiteSpider.  Not that it's any earth-shattering tool, but really surprises me how fast time flies.   The other day I was talking with a buddy about a couple of topics:  control invokes to update controls on a form's main thread, and creating worker threads.  The threading issue is top of mind for me right now in part because our debugging tips and tricks session at the MSDN event talks about debugging multithreaded applications.

I decided to brush the dust off of SiteSpider and do a little polishing.  So, what is SiteSpider?  First a little history:  years ago (I'm guessing 2001-2002 time frame) I was walking past the office of some developers I worked with in another division.  While chatting, I noticed they were running a tool that scanned their content (it was an online encyclopedia, actually) and generated a list of 404's, etc.  

Now, while the tool they used was pretty feature complete, I figured the basic premise would be a fun side project -- and that's SiteSpider.  You point it at a URL, and it crawls a site and generates a link tree, and allows you to look for broken links, slow pages, etc.  It's a handy tool to run against your site a couple times a year.

In this version, the biggest change is support for multiple threads.  You can specify anywhere from 1 to 100 threads -- obviously, extending beyond 10 threads or so is something you need to be cautious of, but this speeds up the work considerably.   The image below shows the new settings dialog:



The request delay applies to the individual thread -- so with multiple threads, bear that in mind.  In addition, the Max Page Size will limit the parsing of a page in case the spider hits a ridiculously large file.

If you're running with the source code, you can see the threads in the threads window while broken into the debugger:



In the above, there are 5 worker threads (all parked on the Fetch method).  The design of the app had to change pretty significantly to support the change.  This is largely because of the circular nature of links in a given site, and the coordination of work between threads.  To facilitate that, an additional thread called Work Coordinator (seen above) manages the worker threads.

The other nice (and simple) change is the history:



... so for frequent sites, it's not just an empty text box.  (They are in order from last used, and can be cleared in the settings dialog.)

Another item I included in this project is a test website.  I created this project primarily as a test bench so various conditions can be tested deterministically.  The main MVC-based site essentially creates an infinitely-deep tree that can be crawled.  Through parameters in the URL, you can specify the number of branches on each node, the current node, the current page on the current node, and the request delay.   For example, suppose you would like to create a tree where each node on the tree contains 6 branches.  If you'd like to start at the root of such a tree, with no page delay, the request would look like so:



In this case, notice we're just sending /6/1 in the URL.  A /2/1 would be a binary tree.  In the case above, I specified a crawl depth of 6.  In this case, this was a fairly deep test and queried about 56,000 pages in a little over 3 minutes (localhost helped out a lot, I admit!).

The other two tests in the projects are a "bigpage" test, that allows you to test the max page size setting.  The other is a test tree (similar in concept as the above, but not dynamic) that attempts to ferret out multithreaded issues and depth issues.  For example, page A has links to B and C, and B has links to C and D.  Page C has a 2 second "working" delay.  There are a few other circular references in the tree, and it's easily modified for testing purposes.  As it is now, a successful test should look like:



So, that's all there is to it.  The application _does_ save the response stream so it's possible to view that response, but it's just not coded yet.  Another known issue:  robots files are not honored.  Ideally, this would be a configurable option. 

To download the executable, click here: 
SiteSpider_Binary.zip

To download the source and test project, click here:
SiteSpider_Source.zip

Comments are closed

My Apps

Dark Skies Astrophotography Journal Vol 1 Explore The Moon
Mars Explorer Moons of Jupiter Messier Object Explorer
Brew Finder Earthquake Explorer Venus Explorer  

My Worldmap

Month List