Weblogs: Javascript

Breaking the Web with hash-bangs

Tuesday, February 08, 2011

Update 10 Feb 2011: Tim Bray has written a much shorter, clearer and less technical explanation of the broken use of hash-bangs URLs. I thoroughly recommend reading and referencing it.

Update 11 Feb 2011: Another very insightful (and balanced) response, this from Ben Ward (Hash, Bang, Wallop.) , great job in separating the wheat from the chaff.

Lifehacker, along with every other Gawker property, experienced a lengthy site-outage on Monday over a misbehaving piece of JavaScript. Gawker sites were reduced to being an empty homepage layout with zero content, functionality, ads, or even legal disclaimer wording. Every visitor coming through via Google bounced right back out, because all the content was missing.

JavaScript dependent URLs

Gawker, like Twitter before it, built their new site to be totally dependent on JavaScript, even down to the page URLs. The JavaScript failed to load, so no content appeared, and every URL on the page was broken. In terms of site brittleness, Gawker’s new implementation got turned up to 11.

Every URL on Lifehacker is now looks like this http://lifehacker.com/#!5753509/hello-world-this-is-the-new-lifehacker. Before Monday the URL was almost the same, but without the #!. So what?

Fragment identifiers

The # is a special character in a URL, it marks the rest of the URL as a fragment identifier, so everything after it refers to an HTML element id, or a named anchor in the current page. The current page here being the LifeHacker homepage.

So Sunday Lifehacker was a 1 million page site, today it's a one page site with 1 million fragment identifiers.

Why? I don't know. Twitter's response when faced with this question on launching "New Twitter" is that Google can index individual tweets. True, but they could do that in the previous proper URL structure before too, with much less overhead.

A solution to a problem

The #!-baked URL (hash-bang) syntax first came into the general web developer spotlight when Google announced a method web developers could use to allow Google to crawl Ajax-dependent websites.

Back then best practice web development wasn’t well known or appreciated, and sites using fancy technology like Ajax to bring in content found themselves not well listed or ranked for relevant keywords because Googlebot couldn’t find their content they’d hidden behind JavaScript calls.

Although Google spent many laborious hours trying to crack this problem, they eventually admitted defeat and tackled the problem in a different manner. Instead of trying to find this mythical content, lets get website owners to tell us where the content actually is, and they produced a specification aimed at doing just that.

In writing about it, Google were careful to stress that web developers should develop sites with progressive enhancement and not rely on JavaScript for its content, noting:

If you’re starting from scratch, one good approach is to build your site’s structure and navigation using only HTML. Then, once you have the site’s pages, links, and content in place, you can spice up the appearance and interface with Ajax. Googlebot will be happy looking at the HTML, while users with modern browsers can enjoy your Ajax bonuses.

So the #! URL syntax was especially geared for sites that got the fundamental web development best practices horribly wrong, and gave them a lifeline to getting their content seen by Googlebot.

And today, this emergency rescue package seems to be regarded as the One True Way of web development by engineers from Facebook, Twitter, and now Lifehacker.

Clean URLs

In Google’s specification, they call the #!-patterned URLs as pretty URLs, and they are transformed by Googlebot (and other crawlers supporting Google’s lifeline specification) into something more grotesque.

On Sunday, Lifehacker’s URL scheme looked like this:

http://lifehacker.com/5753509/hello-world-this-is-the-new-lifehacker

Not bad. The 7-digit number in the middle is the only unclean thing about this URL, and Gawker’s content system needs that as a unique identifier to map to the actual article. So it’s a mostly clean URL.

Today, the same piece of content is now addressable via this URL:

http://lifehacker.com/#!5753509/hello-world-this-is-the-new-lifehacker

This is less clean than before, the addition of the #! fundamentally changes the structure of the URL:

What does this achieve? Nothing. And the URL mangling doesn’t end there.

Google’s specification says that it will transform the hash-bang URL into a query string parameter, so the example URL above becomes:

http://lifehacker.com/?_escaped_fragment_=5753509/hello-world-this-is-the-new-lifehacker

That uglier URL actually returns the content of the article. So this is the canonical reference to this piece of content. This is the content that Google indexes. (This is also the same with Twitter’s hash-bang URLs.)

This URL scheme looks a lot like:

http://example.com/default.asp?page=about_us

Lifehacker/Gawker have thrown away a decade’s worth of clean URL experience, and ended up with something that actually looks worse than the typical templated Classic ASP site. (How more Frontpage can you get?)

Clean? Not on your life!

What’s the problem?

The main problem is that LifeHacker URLs now don’t map to actual content. Well, every URL references the LifeHacker homepage. If you are lucky enough to have the JavaScript running successfully, the homepage then triggers off several Ajax requests to render the page, hopefully with the desired content showing up at some point.

Far more complicated than a simple URL, far more error prone, and far brittler.

So, requesting the URL assigned to a piece of content doesn’t result in the requestor receiving that content. It’s broken by design. LifeHacker is deliberately preventing crawlers from following links on the site towards interesting content. Unless you jump through a hoop invented by Google.

Why is this hoop there?

The why of hash-bang

So why use a hash-bang if it’s an artificial URL, and a URL that needs to be reformatted before it points to a proper URL that actually returns content?

Out of all the reasons, the strongest one is “Because it’s cool”. I said strongest not strong.

Engineers will mutter something about preserving state within an Ajax application. And frankly, that’s a ridiculous reason for breaking URLs like that. The URL of an href can still be a proper addressable reference to content. You are already using JavaScript, so you can do this damage much later with JavaScript using a click handler on the link. The transform between last week’s LifeHacker URL scheme, and this week’s hash-bang mangling is trivial to do in JavaScript using a click handler.

At the risk of invoking the wrath of Jamie Zawinski, LifeHacker can keep its mostly clean URL of last week (http://lifehacker.com/5753509/hello-world-this-is-the-new-lifehacker) and obtain the mangled version by this regular expression:

var mangledUrl = this.href.replace(/(d+)/, "#!$1");

Doing this mangling in JavaScript (during the click handler of the link) means you keep your apparent state benefits, but without needlessly preventing crawlers from traversing your site, and any other non-JavaScript eventuality.

Disallow all bots (except Googlebot)

All non-browser user-agents (crawlers, aggregators, spiders, indexers) that completely support both HTTP/1.1 and the URL specification (RFC 2396, for example) cannot crawl any Lifehacker or Gawker content. Except Googlebot.

This has ramifications that need to be considered:

  1. Caching is now broken, since intermediary servers have no canonical representation of content, they are unable to cache content. This results in Lifehacker perceived as being slower. It means Gawker don’t save bandwidth costs by any edge caching of chunks of content, and they are on their own in dealing with spikes of traffic.
  2. HTTP/1.1 and RFC-2396 compliant crawlers now cannot see anything but an empty homepage shell. This has knock-on effects on the applications and services built on such crawlers and indexers.
  3. The potential use of Microformats (and upper-case Semantic Web tools) has now dropped substantially - only browser-based aggregators or Google-led aggregators will see any Microformatted data. This removes Lifehacker and other Gawker sites from being used as datasources in Hackdays (rather ironic, really).
  4. Facebook Like widgets that use page identifiers now need extra work to allow articles to be liked. (by default, since the homepage is the only page referenceable by a non-mangled URL, and all mangled URLs resolve down to being the homepage)

Being dependent on perfect JavaScript

If content cannot be retrieved from a server given its URL, then that site is broken. Gawker have deliberately made the decision to break these URLs. They’ve left their site availability open to all sorts of JavaScript-related errors:

Such brittleness for no real reason or a benefit that outweighs the downside. There are far better methods than what Gawker adopted, even HTML5’s History API (with appropriate polyfillers) would be a better solution.

(If you thought that invalid XHTML delivered with the correct mimetype was not fit for the web, this JavaScript mangled-URLs approach is far worse)

An Architectural Nightmare

Gina Trapani tweets: Lay down your pitchforks and give @Lifehacker’s redesign a week before you swear it off and insist that the staff doesn’t care about you. A week won’t solve Gawker’s architectural nightmare.

Gawker/Lifehacker have violated the principle of progressive enhancement, and they paid for it immediately with an extended outage on day one of their new site launch. Every JavaScript hiccup will cause an outage, and directly affect Gawker’s revenue stream and the trust of their audience.

Updates (9th February 2011)

Wow. I (and my VPS) am overwhelmed by the conversation this post has sparked. Thank you for contributing towards a constructive discussion. Some of the posts that caught my eye today:

All of the features that hash-bangs are providing can be done today in a safer, more web-friendly way with HTML5's pushState from the History API. (thanks Kerin Cosford & Dan Sanderson)

The Next Web reports that Gawker blogs have disappeared from Google News searches. A Gawker media editor is quoted that they hope to have it resolved soon. They are listed again but using the _escaped_fragment_ form of the URL. So much for clean URLs. Though, the link seems intermittently broken claiming the URL requested is not available (with a redirect to http://gawker.com/#ERR404).

I did like this tl;dr summary of this post over on theawl.com by mrmcd.

Webmonkey have a summary story, but link off to some very handy resources for clean URL strategies. (I first learnt HTML from Webmonkey back in the previous century)

Phillip Tellis, one of the handful of Yahoo's I regret not meeting blogs some Thoughts on performance, well worth reading. Also highly recommended is warpspire's URL Design.

Danny Thorpe talks about Side effects of hash-bang URLs, including URL Cache equivalence. Oliver Nightingale has a nicely worked example using HTML5's pushState in a progressively enhanced way (great job!)

The very short geeky summary of this post (try curling a Lifehacker article canonical URL):


$ curl http://lifehacker.com/#!5753509/hello-world- \
  this-is-the-new-lifehacker | grep "Hello"
$

or as Ben Ward put it: If site content doesn't load through curl it's broken.

Broken HTTP Referrers

Watching my logfiles I'm seeing a number of inbound links to this post from gawker.com and kokatu.com - from the homepage (i.e. the fragment identifier is stripped out). So somewhere on those sites there's a discussion going on about my post, and there's no way of finding it thanks to Gawker's use of hash-bang URLs.


[ Weblog | Categories and feeds | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 ]