Twelve Hours

Twelve hours.

Twelve hours and I still didn't find what was wrong.

I spent a good portion of last night and well, this moring (didn't get to bed until 10:00 am) working on a project for a client. When you freelance … okay, when I freelance, I can loose track of time and that's why I found myself working on a project on a Saturday night/Sunday morning.

The project itself isn't that hard. Data mining. Okay, nothing sexy like hacking a government site in sixty seconds with a gun to your head and getting a blowjob (Swordfish) [1] but hey, it's a living. And since it's pulling down pages from a webserver (it's public information by the way) it can't be that hard, right?

Right?

Twelve hours.

First off, the server I'm pulling from is a Microsoft IIS server and well … you have to be delusional if you think Microsoft follows standards to the letter. I already have to work around a few IIS bugs.

>
```
14.30 Location
The Location response-header field is used to redirect the recipient
to a location other than the Request-URI for completion of the
request or identification of a new resource. For 201 (Created)
responses, the Location is that of the new resource which was created
by the request. For 3xx responses, the location SHOULD indicate the
server's preferred URI for automatic redirection to the resource. The
field value consists of a single absolute URI.
Location = "Location" ":" absoluteURI
An example is:
Location: http://www.w3.org/pub/WWW/People.html
```

§14.30 of RFC-2616 (Hypertext Transfer Protocol---HTTP/1.1) [2]

Right there. Location: contains an absolute URI (Uniform Resource Indicator). But Microsoft? Nah, that would be like … following a standard or something, so when an IIS server sends out a Location: header, it's relative to the base URI the webserver was given. Well, I've worked around that bug long ago, as well as the bug that IIS servers sometimes hand out two sets of headers.

So that's a known quantity. This should be easy enough.

Twelve hours. It's become a mantra.

Now, even though the information is public (mandated by law no less) the owners of the site aren't going to make it easy to actually get to the information. Oh no. The whole site is framed in frames. Hit the wrong URL (Uniform Resource Locator) or neglect to send the correct Referer: header and you get bumped back to a frame.

Annoying, but having to deal with session tracking cookies is even worse. Attempt to avoid using cookies, and “Sorry, the site requires cookies.”

And you can't even get into the site until you click through their licence agreement.

Oh, did I mention this is public information I am pulling out?

I've never dealt with cookies before and well, there's a reason [3] why I never bothered before. Simple in theory but the devil is in the details.

I've been picking through the site using Lynx [4] to pick apart the site and figure out which URLs I need to grab and which URLs I need as refering pages and figuring out the minimum cookie support I need (since my own homegrown library doesn't exactly support cookies) and my code isn't working.

I find out more where Microsoft's IIS is breaking the standard:

>
```
The action performed by the POST method might not result in a
resource that can be identified by a URI. In this case, either 200
(OK) or 204 (No Content) is the appropriate response status,
depending on whether or not the response includes an entity that
describes the result.
If a resource has been created on the origin server, the response
SHOULD be 201 (Created) and contain an entity which describes the
status of the request and refers to the new resource, and a Location
header (see section 14.30).
Responses to this method are not cacheable, unless the response
includes appropriate Cache-Control or Expires header fields. However,
the 303 (See Other) response can be used to direct the user agent to
retrieve a cacheable resource.
```

§9.5 of RFC-2616 (Hypertext Transfer Protocol---HTTP/1.1) [5]

Okay, so I guess Microsoft weasles out with the should clause there because what it does to is sent out a 302 (move temporarily) which I immediately POST to the new location where:

>
```
If the 302 status code is received in response to a request other
than GET or HEAD, the user agent MUST NOT automatically redirect the
request unless it can be confirmed by the user, since this might
change the conditions under which the request was issued.
Note: RFC 1945 [6] and RFC 2068 [7] specify that the client is not allowed
to change the method on the redirected request. However, most
existing user agent implementations treat 302 as if it were a 303
response, performing a GET on the Location field-value regardless
of the original request method. The status codes 303 and 307 have
been added for servers that wish to make unambiguously clear which
kind of reaction is expected of the client.
```

§10.3.3 of RFC-2616 (Hypertext Transfer Protocol---HTTP/1.1) [8]

You can't win coming or going. So in this case, not only is Microsoft IIS possibly in the wrong, but nearly every browser is too! Including the aformentioned Lynx. Although in my case, I don't change the method (frankly, it never occured to me to do such a thing).

Twelve hours.

So I'm spending my time trying to figure out why my code isn't working and yet Lynx does. I enable tracing in Lynx. It doesn't tell me anything that I don't already know. I'm adding headers. I'm mimicing headers.

Twelve hours.

At 10:00 am I give up and head to bed.

I get up and decide to record the actual traffic between my workstation and the server in question, to see exactly what is going on. So I record a session with Lynx, and with my software and look at the raw packets and see what is different between the two.

And that's when I want to slap myself up the head with a large and rather heavy blunt object.

Because it's a problem with my code. In fact, it was a feature of my code that I completely forgot about, seeing how I wrote the code in question back in 1997 (and the last server bug workaround code was added in 1999).

You see, when I was setting the headers to be sent with the request, I was including the characters CR (Carriage Return) and LF (Line Feed) at the end (since that's part of the spec—header lines are separated by those characters) when the code I wrote added the same characters to each header line as it was being sent out.

So no wonder it wasn't working.

Twelve hours.

You can smack me now.

[1] http://us.imdb.com/Title?0244244

[2] http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2616.html

[3] http://home.netscape.com/newsref/std/cookie_spec.html

[4] http://lynx.browser.org/

[5] http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2616.html

[6] http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1945.html

[7] http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2068.html

[8] http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2616.html

Gemini Mention this post

Contact the author