Weblogs: Semantic Web

OPML - the XML format with no friends

Saturday, October 01, 2005

Rogers Cadenhead sparks off an interesting conversation about OPML in his piece "OPML: Even worse than it appears". OPML directories has re-surfaced. Perhaps this particular occurrence will last longer than the super open directory phase, and "The World Outline" that preceded it.

Recent history of OPML directories

There's nothing new going on here. Adam Curry gave OPML directories a massive boost of support - it's the cornerstone of the ipodder directory (now known as the indie podder directory). Where ipodder creates a directory page from an OPML script, the TechCrunch initiative whittles it off to a sidebar.

My interest in OPML

I've been interested in and worked with OPML for quite some time. In fact my first half-decent Atom Publishing Format prototype - isoTope - accepted OPML as Atom content and rendered it as XHTML. (It also supported OML). I've been tinkering with OPML as a means of building a directory, but keep running into some fundamental problems.

I've been experimenting with the OPML editor since the first version was publicly available. I like it as an organiser of my notes and thoughts, and these tend to end up as a structured post. In my experience, using it as a multi-post blog tool is like using a screw-driver as a hammer. Using an outline encourages structure - my blog posting style of longer one-per-page blog posts is complemented by the OPML editor.

OPML dichotomy

I think OPML suffers because of the many different uses it has. Outlining is a great way of structuring content - it's a natural approach to writing and editing. We're now seeing OPML's usefulness again in defining directories - something that was tried a while ago with a community based Yahoo-like directory.

The same data format being used for two different types of data - visually and conceptually different. A directory is just a series of nested lists, an article outline is far meatier in content and words. But, from the XML alone it's difficult, given an OPML file, to figure out what type of data it actually describes: directory or a structured article.

That to me is problem number 1. The attention to OPML directories distracts from OPML as a content structure.

OPML prefers plain text

Problem 2 is that OPML comfortably supports only plain text - particulary because it stores content in attributes.

HTML is handled by entity-escaping the markup. Trying to store HTML inside an attribute is already leading to entity encoding problems, similar to the entity encoding problems of RSS.

Ironically, XML is treated like a second class citizen. Because attributes are the only container for content and information, its impossible to have XML content inside an OPML document and still use the standard XML tools to interrogate and manipulate the content.

Instead of streaming one XML document through a parser and get a reference to all the XML in the document in one go, the OPML way involves feeding the OPML document through the parser, and each time it encounters XML entity escaped inside an attribute it has to create a new parser instance to parse that content before a script can make any use of it.

OPML and the OPML editor

I've spent some hours looking through the UserTalk scripts that make up the OPML editor. I was surprised to see how little of the code actually has anything to do with OPML. Saving and opening a file is about all that uses OPML - each time it calls a kernel function (a non-UserTalk function) to translate the outline data structure to an OPML file.

OPML isn't even the data format used to transmit the data from the client to the server. The OPML Editor reads in the OPML file and creates a data structure. This data structure is passed to the XML-RPC functions which creates its own unorthodox format and sends that to the server. At which point the server gives its script a data structure which is then written to the file system as an OPML file.

The OPML editor highlights the main weakness of the OPML format - it's underspecified, thus ambiguous. On the Community menu of the application there's a submenu called OPML conversions. This offers three menu items to convert Bloglines files, RSS Bandit files and SharpReader files into OPML. But all three input formats are OPML. So there's three separate methods to convert OPML into OPML. That is not a good sign.

XML and the OPML editor

It's disappointing that the OPML editor, powered by the Frontier platform, is devoid of any decent XML tools. I was shocked to see that the RSS feeds are created with nothing more that a series of print statements. I don't see DOM- or SAX-type methods being used to create XML documents. It's all string concatenation.

Reading in XML is a little better. The OPML Editor uses a generic function that converts an XML document into a nested series of table data structures that can be interrogated easily enough in scripts. This is similar to the first support Perl had for XML, before the real XML parsers arrived.

Concatenating strings together is not a good way of doing XML-related development. I learnt that the hard way in prototyping bits and pieces for the Atom Syntax and Publishing formats.

Other arguments against OPML

There is a strong argument against OPML ever needing to exist in the first place. As a vocabulary for describing hierarchies, OPML is seen as redundant because it is offering the same features XML already has. Hierarchy in XML is represented by parent-child element nesting.

The Wikipedia article on OPML raises an insightful point: Information about OPML items cannot itself be hierarchically marked up (ironically), due to the use of attributes to store that information.

Where to go from here

OPML in its current form has no future. It will be perpetually stuck shunting around plain text. An overhaul of OPML to allow interaction with other XML vocabularies will affect OPML's ability to move structured plain text around, and will most certainly break code that parses OPML. So that doesn't look to be a viable option.

The other alternative is a different XML format. OML, which is based on OPML, adds in structures to allow content to be put into elements instead of attributes. It would be useful to see the OPML editor supporting OML - that at least gives the format a chance of being equal in stature to Web 2.0 XML vocabularies.