(Also: An experiment…)

In the last post, I made the unspeakable blargger mistake of linking to an article on a news site, which means in 7 days or so, instead of said article this link will return absolute crap and/or a “Buy membership now!” nag screen. Trying to keep up with such link rot (if anyone bothered) is a problem that grows linearly with the number of posts/articles written, until it reaches 100% of the blogger’s time and he/she/subject/verb has to stop writing any more posts and become a forest ranger. I’ve ranted this before with some possible solutions, but as you may have guessed based on my project completion record to date, didn’t get around to it (got maybe as far as writing a toy script that wget’s pages and stuffs the contents into a database record).

So a little experiment: Instead of linking to the article directly, I linked to a carefully-constructed “I’m Feeling Lucky” Google query containing unique phrases contained in the article. The idea is that as the site shuffles stuff around / deletes content / recycles numeric links, rather than a 404* the link should preferentially return a clean copy of the article from somewhere else on the Internet if it exists (syndicated copy, fulltext copy-paste into a blog/slashdot post somewhere, etc.).

Let’s see if it lasts any longer than a regular news-site link!

(For anyone interested, the actual query is:

http://www.google.com/search?q=%22A+company’s+backroom+mass+of+servers+and+switches+is+cloudlike.+So+are+social-networking+sites+like+Facebook+Inc.%2C+or+the+act+of+buying+a+book+on+Amazon.+Some+clouds%2C+like+Google’s+email%22&btnI=Lucky

The “%22″ at the beginning and end of the query string itself is the URL-safe encoding for a double-quotation mark (ASCII code 0x22), so that the quote marks in the query don’t conflict with the quote marks in the <a href=”…”> tag. To simulate a click of the “I’m Feeling Lucky” button, replace the button-type code that normally appears in the query (btnG=Search) with “btnI=Lucky”. Also note that apparently Google limits queries to a maximum of 32 words.)

* Modern commercial sites seldom, if ever, actually return a HTTP 404 code when a document is not found, since software including search-engine spiders detect these and drop 404’d pages from their listings. it’s far more profitable to pretend the user/bot has reached some kind of non-error document, swap in a generic landing page and stuff it full of keywords and advertising.

Comments

Leave a Reply