Breaking the Web with hash-bangsTuesday, February 08, 2011
Update 10 Feb 2011: Tim Bray has written a much shorter, clearer and less technical explanation of the broken use of hash-bangs URLs. I thoroughly recommend reading and referencing it.
Update 11 Feb 2011: Another very insightful (and balanced) response, this from Ben Ward (Hash, Bang, Wallop.) , great job in separating the wheat from the chaff.
Every URL on Lifehacker is now looks like this
http://lifehacker.com/#!5753509/hello-world-this-is-the-new-lifehacker. Before Monday the URL was almost the same, but without the
#!. So what?
# is a special character in a URL, it marks the rest of the URL as a fragment identifier, so everything after it refers to an HTML element id, or a named anchor in the current page. The current page here being the LifeHacker homepage.
So Sunday Lifehacker was a 1 million page site, today it's a one page site with 1 million fragment identifiers.
Why? I don't know. Twitter's response when faced with this question on launching "New Twitter" is that Google can index individual tweets. True, but they could do that in the previous proper URL structure before too, with much less overhead.
A solution to a problem
#!-baked URL (hash-bang) syntax first came into the general web developer spotlight when Google announced a method web developers could use to allow Google to crawl Ajax-dependent websites.
Although Google spent many laborious hours trying to crack this problem, they eventually admitted defeat and tackled the problem in a different manner. Instead of trying to find this mythical content, lets get website owners to tell us where the content actually is, and they produced a specification aimed at doing just that.
If you’re starting from scratch, one good approach is to build your site’s structure and navigation using only HTML. Then, once you have the site’s pages, links, and content in place, you can spice up the appearance and interface with Ajax. Googlebot will be happy looking at the HTML, while users with modern browsers can enjoy your Ajax bonuses.
#! URL syntax was especially geared for sites that got the fundamental web development best practices horribly wrong, and gave them a lifeline to getting their content seen by Googlebot.
And today, this emergency rescue package seems to be regarded as the One True Way of web development by engineers from Facebook, Twitter, and now Lifehacker.
In Google’s specification, they call the
#!-patterned URLs as
pretty URLs, and they are transformed by Googlebot (and other crawlers supporting Google’s lifeline specification) into something more grotesque.
On Sunday, Lifehacker’s URL scheme looked like this:
Not bad. The 7-digit number in the middle is the only unclean thing about this URL, and Gawker’s content system needs that as a unique identifier to map to the actual article. So it’s a mostly clean URL.
Today, the same piece of content is now addressable via this URL:
This is less clean than before, the addition of the
#! fundamentally changes the structure of the URL:
- The path
- A new fragment identifier of
What does this achieve? Nothing. And the URL mangling doesn’t end there.
Google’s specification says that it will transform the hash-bang URL into a query string parameter, so the example URL above becomes:
That uglier URL actually returns the content of the article. So this is the canonical reference to this piece of content. This is the content that Google indexes. (This is also the same with Twitter’s hash-bang URLs.)
This URL scheme looks a lot like:
Lifehacker/Gawker have thrown away a decade’s worth of clean URL experience, and ended up with something that actually looks worse than the typical templated Classic ASP site. (How more Frontpage can you get?)
Clean? Not on your life!
What’s the problem?
Far more complicated than a simple URL, far more error prone, and far brittler.
So, requesting the URL assigned to a piece of content doesn’t result in the requestor receiving that content. It’s broken by design. LifeHacker is deliberately preventing crawlers from following links on the site towards interesting content. Unless you jump through a hoop invented by Google.
Why is this hoop there?
The why of hash-bang
So why use a hash-bang if it’s an artificial URL, and a URL that needs to be reformatted before it points to a proper URL that actually returns content?
Out of all the reasons, the strongest one is “Because it’s cool”. I said strongest not strong.
Engineers will mutter something about preserving state within an Ajax application. And frankly, that’s a ridiculous reason for breaking URLs like that. The URL of an
At the risk of invoking the wrath of Jamie Zawinski, LifeHacker can keep its mostly clean URL of last week (
http://lifehacker.com/5753509/hello-world-this-is-the-new-lifehacker) and obtain the mangled version by this regular expression:
var mangledUrl = this.href.replace(/(d+)/, "#!$1");
Disallow all bots (except Googlebot)
All non-browser user-agents (crawlers, aggregators, spiders, indexers) that completely support both HTTP/1.1 and the URL specification (RFC 2396, for example) cannot crawl any Lifehacker or Gawker content. Except Googlebot.
This has ramifications that need to be considered:
- Caching is now broken, since intermediary servers have no canonical representation of content, they are unable to cache content. This results in Lifehacker perceived as being slower. It means Gawker don’t save bandwidth costs by any edge caching of chunks of content, and they are on their own in dealing with spikes of traffic.
- HTTP/1.1 and RFC-2396 compliant crawlers now cannot see anything but an empty homepage shell. This has knock-on effects on the applications and services built on such crawlers and indexers.
- The potential use of Microformats (and upper-case Semantic Web tools) has now dropped substantially - only browser-based aggregators or Google-led aggregators will see any Microformatted data. This removes Lifehacker and other Gawker sites from being used as datasources in Hackdays (rather ironic, really).
- Facebook Like widgets that use page identifiers now need extra work to allow articles to be liked. (by default, since the homepage is the only page referenceable by a non-mangled URL, and all mangled URLs resolve down to being the homepage)
- A debugging console.log line accidentally left in the source will cause Gawker’s site to fail when the visitor’s browser doesn’t have the developer tools installed and enabled (Firefox, Safari, Internet Explorer)
Such brittleness for no real reason or a benefit that outweighs the downside. There are far better methods than what Gawker adopted, even HTML5’s History API (with appropriate polyfillers) would be a better solution.
An Architectural Nightmare
Gina Trapani tweets:
Lay down your pitchforks and give @Lifehacker’s redesign a week before you swear it off and insist that the staff doesn’t care about you. A week won’t solve Gawker’s architectural nightmare.
Updates (9th February 2011)
Wow. I (and my VPS) am overwhelmed by the conversation this post has sparked. Thank you for contributing towards a constructive discussion. Some of the posts that caught my eye today:
The Next Web reports that Gawker blogs have disappeared from Google News searches. A Gawker media editor is quoted that they
hope to have it resolved soon. They are listed again but using the
_escaped_fragment_ form of the URL. So much for clean URLs. Though, the link seems intermittently broken claiming the
URL requested is not available (with a redirect to
I did like this tl;dr summary of this post over on theawl.com by mrmcd.
Webmonkey have a summary story, but link off to some very handy resources for clean URL strategies. (I first learnt HTML from Webmonkey back in the previous century)
Danny Thorpe talks about Side effects of hash-bang URLs, including URL Cache equivalence. Oliver Nightingale has a nicely worked example using HTML5's pushState in a progressively enhanced way (great job!)
The very short geeky summary of this post (try curling a Lifehacker article canonical URL):
$ curl http://lifehacker.com/#!5753509/hello-world- \ this-is-the-new-lifehacker | grep "Hello" $
or as Ben Ward put it:
If site content doesn't load through .
curl it's broken
Broken HTTP Referrers
Watching my logfiles I'm seeing a number of inbound links to this post from gawker.com and kokatu.com - from the homepage (i.e. the fragment identifier is stripped out). So somewhere on those sites there's a discussion going on about my post, and there's no way of finding it thanks to Gawker's use of hash-bang URLs.