.NET Expecting the Unexpected

As some of my friends know, I decided to write my own blogging engine way back when ... I figured it would be a fun way to get immersion with RSS.  I don't regret that decision but it turned out being a lot more work than I thought.

In this weekend's update, I decided my trackback functionality needed a facelift.  I never spent much time working on it ... and it's quite a bit more complicated to implement over a comments engine, for example.  You can read the spec here, but in short, trackbacks are easiest to describe as "remote comments."  It's letting other blog owners (or, potentially news-related sites) know that you're commenting about a particular article or post.

So the broad steps to implement a trackback engine are:

1. The Auto-Discovery phase: you want to be able to send trackbacks automatically based on the links within your post.  This one is pretty straight forward: regex to pull out all links.

2. The Search phase:  After the links are parsed, the links themselves need to be queried (HTTP GET) to see if the trackback XML is embedded in the response stream of the URL ... the standard format looks like this:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
<rdf:Description
    rdf:about="http://www.foo.com/archive.html#foo"
    dc:identifier="http://www.foo.com/archive.html#foo"dc:title="Foo Bar"
    trackback:ping="http://www.foo.com/tb.cgi/5" />
</rdf:RDF>

3.  The Ping Phase:  Once the trackback URLs are known (they're different from the actual link to the article), an HTTP POST is sent with the requisite information (URL, excerpt, title, and blog name).

4.  The Manual Phase.  Not all blog engines support this specification (mine didn't until today) so sometimes, a trackback URL needs to be entered manually.  This is only needed in cases where the trackback URL is undiscoverable programmatically.

5.  The Receive Phase.  Receiving trackbacks has little in common with sending them.  Aside from receiving the URL, excerpt, title, and blog name from the POST, there's a lot of anti-spam functionality that can (and perhaps should) be built in.  For example, some engines re-ping the incoming URL and verify a link to the post or article exists before accepting the trackback.

Sounds simple, right?  It's not too bad -- but already much more complicated than getting a commenting engine going.  The first problem I see with virtually every blog engine out there:  how come multiple trackbacks are not filtered?  Is it that difficult to detect duplicates when an engine receives a ping?  Assuming a trackback hit is stored in the database, a composite key between the URL and Article ID would take care of that in a hurry -- "I'm sorry, I already have a trackback for that URL."  In fact, that's exactly what I've done on my side.

Since the auto-discovery will likely happen each time the article is saved, any previous trackbacks sent would be duplicated and seen in the recipients blog.  The initial trackbacks are typically sent only happen when the entry is first made public, but occasionally posts are edited.  The amount of logic client side to keep track of who has been pinged, the response codes, etc., to simply skirt the duplicate issue that should be implemented on the recipient's end is fairly intensive.  (And, saving network bandwidth isn't that reasonable of a "why.")

The second issue I ran into was a new one, so I'm sharing it here because I'm sure someone will be searching the internet with this problem.  As part of the HTTP 1.1 protocol, the WebClient and WebRequest methods will send an Expect: 100-continue header for HTTP POSTs.  The logic makes sense: don't send a ton of data and then find out the server is rejecting the POST based on another header or if the server issues a 302, for example.  Assuming the server is ready for the data, the server responds, "Sure, continue..." and the data gets sent.

If the web server being posted to does not have any concept of this header (HTTP 1.0 for example -- but I'm certain I've seen this on some web servers that are HTTP 1.1 capable) the result is what appears to be a timeout.   If you've experienced this problem, you may have tried GETs just to confirm that it works, you'll then do some packet sniffing and searching the internet.  You'll be perplexed because it works on many servers, and on some it simply appears to hang. 

Obviously this could be a myriad of problems -- firewalls, routers, proxies, malformed requests, to name a few.  But because the Expect header is sent only for POSTs, a quick check to see if a GET request works is a strong indicator that this may be the issue.  Obviously, there's no ability to write data to the request stream with a GET, but it's an easy test.

So, unless you're posting large amounts of data, just turn off that pesky Expect header by setting this value before creating a WebRequest or WebClient object:

System.Net.ServicePointManager.Expect100Continue = false;

Comments (1) -

Zloof
Zloof
1/13/2006 10:04:41 AM #

i use
System.Net.ServicePointManager.Expect100Continue = False
        Dim client As New WebClient

and it does not work?(

Comments are closed

My Apps

Dark Skies Astrophotography Journal Vol 1 Explore The Moon
Mars Explorer Moons of Jupiter Messier Object Explorer
Brew Finder Earthquake Explorer Venus Explorer  

My Worldmap

Month List