Weblogs: Semantic Web

NewsMonster irks Mark Pilgrim

Friday, February 21, 2003

Ben Hammersley has noticed the ongoing slightly toasted conversation going on about newsmonster. In short, newsmonster is a news-aggregator that runs inside mozilla, and is pretty aggressive in retrieving html content for offline use, snagging URLs from RSS feeds and caching those for offline reading. The author points out that this is not a default option. This in turn has infuriated Mark Pilgrim since newsmonster follows RSS links but doesn't read robots.txt directives (well, Mozilla doesn't read robots.txt directives), and goes ahead downloading the actual HTML and resources. I can understand Mark's frustration since his RSS feeds contains his blog entry in its entirety (sans any comments), so this is just duplicating downloading which is wasting Mark's bandwidth.

Where things go a little awry is the actual use of RSS, and it looks to be a difference of opinion between what RSS is and what its used for. It seems that Mark believes that RSS should be a Really Simple Syndication format, so he puts all his content into an RSS feed. Kevin, NewsMonster's author, seems to believe that RSS is a RDF Site Summary where entries are summaries of HTML pages, hence the reason for pre-caching interesting pages. Who is right? Probably both sides are right.

Although, take any random RSS feed. How does an aggregator tell whether an RSS feed is for syndication or for summary? I would hazard a guess that any RSS feed following Dave Winer's specification should be treated as syndication, and those following RDF style specifications are site summaries. In that case, Mark is following the RDF Site Summary style, so his RSS feed is a summary of his website. Surely a summary means a short round-up of items, so if something is interesting, take the HTML link and read the article in its entirety. I guess my question is "What is a site summary, Mark?". Hopefully I'll get an answer I can use in the building of my little tool.

Update: Michael Bernstein has given me an answer I can use, it looks pretty reasonable and logical. Of course, if bloggers want to prevent automated news-aggregators from retrieving their content, then "their readers will let them know soon enough". Michael Bernstein is currently a regular poster and contributor to Mitch Kapor's Chandler project. I've found his posts informative and well thought-out. He certainly has the skill of picking up a subject he doesn't fully know and wringing out all the knowledge there is to be found, and when he then opens his mouth, readers can be pretty sure he knows what he's talking about. A useful skill indeed.


[ Weblog | Categories and feeds | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 ]