Weblogs: Intelligent Agents
Shining and Polishing
Monday, September 16, 2002The second implementation, actively seeking out textual content, is done, and theRegister parses rather nicely. I had to put back the rule to drop paragraphs with less than three words to remove the menu titles, and initially I had a weighting against bold and italic paragraphs which dropped the signoffs and author details on theRegister.
Now just a few refinements to put in:
- Remove paragraphs that contain no lowercase text - should remove some silly menu text
- Remove paragraphs that hold only <small> enclosed text - this is probably footer wording
- Keep a list of open in-line elements and force them closed if an end paragraph is found - then the HTML output is valid
[ Weblog | Categories and feeds | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 ]