Weblogs: Intelligent Agents

Making Sense out of tag-soup

Sunday, September 08, 2002

One of the critical parts of an intelligent agent is the function that allows it to look at a web page and extract all the meaningful bits out of it. With proper web development standards this should be a piece of cake, since the HTML would be properly structured and marked-up, identifying titles, headers, paragraphs, citations, quotes and abbreviations. With valid XHTML-authored pages with CSS presentations the job is so much easier with XML parsers.

Reading the content off some of the larger websites (like CNN.com) is much more difficult because of the nested table layout. These designs have been infesting the World Wide Web for the last four years. They are mainly "two-browser" dependant (It displays as intended in Netscape 4+ and Internet Explorer 4+ - but hardly anything else). The content within these pages are lost deep into the HTML structure, and even then the content isn't adequately marked-up.

But this task needs to be done. With Tim Berners-Lee's conception of a Semantic Web (which includes deriving relationships between disparate webpages based on meaning and ideas, and not just keywords) on the horizon it will take a lot of well-structured markup before the value and benefits this new "technology" (or concept) can be clearly displayed. So we need lots of well-structured (and valid) markup to populate the foundations of the World Wide Web (extended). So we have to convince people to author and publish their articles within well structured and defined markup. How do we do that - we can't offer them the Semantic Web yet because there's not enough structured markup around. Its a Catch-22, the Semantic Web required structured markup, but people won't author structured markup until the Semantic Web arrives and they can clearly see the benefits it offers.

So in the interim, we are going to need a translation process that takes a tag-soup page and tries to extract the relevant content from it, trimming out all the presentation-infested markup. We can then take the content and then restructure it validly, and then the information becomes of some use. In effect, we are removing all presentation from the HTML to be left with clean and (hopefully) structured markup.

This isn't a perfect science, and quite a large number of pages are quite impossible to infer any structure from, but we need a starting point to work from, some way of taking the information now and allowing it to be leveraged by standards-compliant tools. One of my optimistic aims is to be able to take a fixed-page nested-table tag-soup webpage, clean it up and represent it in a form that a small-screen device finds usable and accessible.

So the first script I want to hack up is one that takes a webpage, extracts content from it into one "plain" looking but structurally understood markup, and then transform that into a series of "cards" reminiscent of the mobile-phone WML. I'm currently using HTML::Parserand HTML::Element

[ Weblog | Categories and feeds | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 ]