Mark Pilgrim's latest Dive Into XML column has been published. Mark details how the RSS Validator is architected to process RSS and identify problems even if its malformed invalid XML. (Mark promises something other then RSS next month.) Reading this article a few comments came to mind.
Mark never covered the topic of unneighborly RSS – RSS that is perfectly legal by the RSS spec
(or lack thereof), but causes logical errors, garbled display or in some cases receiving applications to just choke. (Oh wait, I did that already. ) My real wish is for the RSS Validator to provide warnings to these unneighborly practices along with tips to rectify the issues.
The system that Mark uses to parse a file is of interest to me currently and quite timely. In working to refactoring the TikiText engine to be more easier to extend and more efficient to run, I've actually been considering a similar approach. I originally based the code loosely on Text::WikiFormat and other Wiki implementations. Generally speaking, a series of individual regex(regular expressions) are passed over the same string. Order of operations becomes critical and sometimes, as I've found, conflicts are irreconcilable.
So while it works to a degree, it has become clear that this approach is flawed and that, in terms of processing, TikiText is not much different then XML. I've been restudying Dave Cameron's REX, a regular expression XML parser, that is the basis of the XML::Parser::Lite module I know all about. I've always marveled at REX because, when expanded, its the single longest and most complex regular expression I've viewed. It actually works quite well assuming you don't hit your head on the limitations of Perl's regular expression handling of Unicode characters. The problems of my bad hair day
where introduced during the implementation of REX into the XML::Parser::Lite.
Currently I'm working on creating a couple large regular expression that create a stream of tokens that are passed (some times recursively) to various handlers and buffers. With all of my free time (ha. ha.) I hope to have something working by the end of the weekend.
