The latest of Mark Pilgrim's Dive Into XML column is out. This month covers the topic of coping with invalid XML disguised as RSS feeds without an XML parser. Before you get too excited Mark concludes:
Hopefully we're trying to use a real XML parser first and only falling back on this messy regular expressions-based sgmllib parser when that fails. However, in flagrant abuse of all things pure and sacred, I have managed to extend this script into a full-fledged parse-at-all-costs RSS parser that supports all the advanced features of RSS, including namespaces. It even handles exotic variations of RSS 0.90 and 1.0, where everything is explicitly placed in a namespace (even the basic
title,link, anddescriptiontags). I don't recommend it, but it works for me.
Mark makes excellent observations as he presents his case and shows that he has the balls to write an article for XML.com demonstrating how to parse an XML-based format without an XML parser.
(His words from his weblog.)
At the same time this article is more of the same news we RSS-aware people have already heard. I suppose it can't be reiterated enough.
Incidentally, it was Mark's initial observations on RSS and the release of his ultra liberal RSS parser that lead me into my foray with RSS that still curses me to this day.
I still will foolishly continue to advocate well-formed and hope for the day where only 1% of feeds are malformed.

Leave a comment