Weblogs: Intelligent Agents

Rewriting the Content Stripper

Thursday, September 26, 2002

Started last night on the rewrite of the Content Extractor (the little routine that takes a tag-soup html page and extracts out all the content and discarding all the soup) using HTML::TreeBuilder and HTML::Element. Apart from losing the end tags for list items it seems to work a treat. It took me a while getting my head around the $tree->look_down using an anonymous function to do the check whether an element contains any textual content. I kinda solved that by outputting the extracted elements as XML instead of HTML. That should be okay as long as I don't have to carry over br, hr and img elements.

So far the new script pulls out the content, but doesn't leave the most valid of HTML. List items are pulled out without their ul or ol containers - that definitely needs to be corrected. I'll need to complicate my anonymous function to get that working smoothly, probably I'll have to traverse a few nodes down looking "through" in-line elements looking for decent textual content

The other thought is that if we have a div with some text and a "layout" table, then the div is going to be collared as textual content, including the tag-soup layout table, so I'm going to have to reparse the table... unless I can adopt a similar rule of enclosing the found text into a paragraph and just ignoring the divs.


[ Weblog | Categories and feeds | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 ]