Weblogs: Intelligent Agents

Parsing tag soup

Monday, September 16, 2002

The first version of my HTML parsing algorithm is complete. With a few tweaks yahoo, amazon, cnn, bbc, stattoshop, ananova and microsoft test pages all came through okay, but theRegister came up blank. My algorithm identifies content as that preceeded by h1-h6, title or paragraph tags which worked to some extent (even those that left out the closing tags), but theregister has no paragraph elements at all, just text and <br>s.

So I've started down a different track to see if I can extract decent content from pages typified by theRegister. Running the HTML through a SAX-like HTML Parser (HTML::Parser) and catching all the text events. My previous algorithm was based around a list of valid elements containing text, the newer approach will be to treat all text events as content except elements such as script and element. As in both algorithms I'm ignoring tables, and treating a lot of elements as pseudo-paragraphs. So far the approach seems feasible.

To differentiate link lists from real text, I count the number of linked and unlinked words. If the number of linked words is comfortably greater, then I discard the block element as text since it is most likely a menu of sorts. Seems to work quite well.

The second implementation I'm hoping to finish in the next few days... At least I'll have a function that can strip to majority of tag soup out, and convert content into blocks of text.


[ Weblog | Categories and feeds | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 ]