Weblogs: Intelligent Agents

Comparing Trees

Friday, September 20, 2002

Over the past few days I've been reading up on Sean Burke's wonderful HTML::TreeBuilder package, extended from HTML::Element and HTML::Parse it allows an HTML document to be expressed as a tree structure of nodes. Treating the document as a tree allows the neat advantage that outputing the tree results in normalised HTML, and we can treat an HTML page almost like an XML document.

Now since the vast majority of HTML (in a tag-soup flavoured world) consists of common nested tables and common menus it is possible to create a "contentless" tree - basically a template, and by comparing this tree with the original HTML page we can extract out the differences. The majority of those differences would be the content itself - neat!

This may probably be a good point to actually switch to a proper HTML normaliser, now that I've gotten to grips with HTML Parsing to extract content. Though its a little tricky at this stage to pick out the proper language. There are pluses and minuses to all the lanugages, some are a matter of taste, and some are about the end-product itself. The languages I've got to chose from are:

Perl
Java
Python
PHP

Perl is the natural choice from the delivery stage, since almost any webserver runs it, so its a nice language to write a demo in. The main selling point of Perl is that its possible to write Apache modules using it, so overloading mod_proxy with a intelligent agent logic is on the cards, and of course, the quantity of Perl modules. I'm running Activestate Perl at the moment, just because the laptop I care around with me is a Windows 98 based Thinkpad i1400. I've been so tempted to use the Thinkpad 600 running Redhat 7.3, and would really do so if I could get the internal modem completely working (and I miss the soundcard *g*), then at least I could use it as a complete replacement for my i1400. Perl loses on the Windows front because of the limted selection of pre-compiled modules, and I'm too snobbish to buy Visual Studio 6.

Java is a natural choice for me, since its the language of choice where I work. I've seen plenty of webservers written in Java, so writing a proxy filter ready to do intelligent agent stuff is just an extension. The drawback so far with Java is the lack of a standardised regular expression handling (all hail Perl!), so parsing HTML is a little bit of a pain. Nick has mentioned Necko as an HTML normaliser written in Java - its mentioned on the apache.org website, so it must be of some use. Plus the apache RegEx library could be an excellent external package to do regular expressions. Of course, Java's XML ability is second to none, and the general feeling is that Java is an excellent platform for mobile agents. Hmmm, I'm starting to convince myself. The only drawback is that demoing stuff online is going to be a pain, unless I can collect it all up into a jarfile for distribution.

Python is a brilliant scripted OO language, close to Perl but with "proper" OO features. Its XML support is excellent, with 4Suite.orgs package. Its a language I'm keen on learning, I've got the tons worth of books waiting to be read (or cluttering up a shelf). The drawbacks, well its not going to be demoable on a website.

PHP the least favourite of my favourite options. Great for demoing stuff online, but without the right packages pre-compiled on my host, it falls a little short. Plus its Object Orientation isn't the most intuitive. I do love PHP for web development, but I suppose I need to get serious about my intelligent agent ideas and start using the right tools that can do the job in the long run.

Well, its kinda good that I've catalogued my thoughts like this, the more I think about Java the more it sounds the right way to go.

[ Weblog | Categories and feeds | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 ]