February 2003 Archives

Text Processing Innards.

| No Comments

Mark Pilgrim's latest Dive Into XML column has been published. Mark details how the RSS Validator is architected to process RSS and identify problems even if its malformed invalid XML. (Mark promises something other then RSS next month.) Reading this article a few comments came to mind.

Mark never covered the topic of unneighborly RSS – RSS that is perfectly legal by the RSS spec (or lack thereof), but causes logical errors, garbled display or in some cases receiving applications to just choke. (Oh wait, I did that already. ) My real wish is for the RSS Validator to provide warnings to these unneighborly practices along with tips to rectify the issues.

The system that Mark uses to parse a file is of interest to me currently and quite timely. In working to refactoring the TikiText engine to be more easier to extend and more efficient to run, I've actually been considering a similar approach. I originally based the code loosely on Text::WikiFormat and other Wiki implementations. Generally speaking, a series of individual regex(regular expressions) are passed over the same string. Order of operations becomes critical and sometimes, as I've found, conflicts are irreconcilable.

So while it works to a degree, it has become clear that this approach is flawed and that, in terms of processing, TikiText is not much different then XML. I've been restudying Dave Cameron's REX, a regular expression XML parser, that is the basis of the XML::Parser::Lite module I know all about. I've always marveled at REX because, when expanded, its the single longest and most complex regular expression I've viewed. It actually works quite well assuming you don't hit your head on the limitations of Perl's regular expression handling of Unicode characters. The problems of my bad hair day where introduced during the implementation of REX into the XML::Parser::Lite.

Currently I'm working on creating a couple large regular expression that create a stream of tokens that are passed (some times recursively) to various handlers and buffers. With all of my free time (ha. ha.) I hope to have something working by the end of the weekend.

Also on Boing Boing today, Xeni Jardin notes Aside from being a decent and compassionate human being, Fred Rogers was also a champion of fair use. From the website of the Home Recording Rights Coalition: (My emphasis.)

In [the Sony Betamax] ruling that home time-shift recording of television programming for private use was not copyright infringement, the Supreme Court relied on testimony from television producers who did not object to such home recording. One of the most prominent witnesses on this issue was Fred Rogers.

The Supreme Court wrote: Second is the testimony of Fred Rogers, president of the corporation that produces and owns the copyright on Mister Rogers' Neighborhood. The program is carried by more public television stations than any other program. Its audience numbers over 3,000,000 families a day. He testified that he had absolutely no objection to home taping for noncommercial use and expressed the opinion that it is a real service to families to be able to record children's programs and to show them at appropriate times.

(Excerpt from Mr. Rogers' trial testimony: ) Some public stations, as well as commercial stations, program the 'Neighborhood' at hours when some children cannot use it. . . . I have always felt that with the advent of all of this new technology that allows people to tape the 'Neighborhood' off-the-air, and I'm speaking for the 'Neighborhood' because that's what I produce, that they then become much more active in the programming of their family's television life. Very frankly, I am opposed to people being programmed by others. My whole approach in broadcasting has always been 'You are an important person just the way you are. You can make healthy decisions.' Maybe I'm going on too long, but I just feel that anything that allows a person to be more active in the control of his or her life, in a healthy way, is important.

I actually began to tear up when I saw his obituary yesterday in the New York Times, much like I did when I read Charles Schultz's. I guess having watched hundreds of his shows growing up and now once again with my daughter, it feels like the death of a family friend. I suppose in some ways it is. Reading his testimony on this matter I have a new appreciation for what a truly special guy he was.

Relating back to yesterday's mention of Interwoven's absurd patent claim on web site versioning, Boing Boing points us to this Wired interview with Ralph Nader where he says:

Name one genius inventor who has gotten rich from a software patent. There must be some, but the system mostly benefits a handful of businesspeople and lawyers who don't write code. Look at British Telecom. It took years before BT's patent lawyers discovered the company had invented hypertext linking. Now General Electric claims it invented the JPEG file format. If GE is so smart, why did it take so many years to figure out it invented such a popular technology? Which genius inventors get rich on such claims?

CVS2RSS: This is, of course, what RSS was invented for: Kellan's CVS 2 RSS - it generates an RSS feed of CVS checkins. (via Ben Hammersley via Jeremy Allaire's Weblog )

This is neat and a great example of RSS-based Web service. I'll note that Jon Udell isn't the only one that has been speculating about RSS' expanding roll beyond content syndication and into the broader space of Web services. Joe Gregorio, DJ Adams and myself are among those who have also been speculating for some time. Interestingly this script uses the RDF-happy RSS 1.0 rather then RSS 2.0. RSS 2.0 is of course not capable with 1.0, but could be with 0.91 if you choose… (Shouldn't we have resolved this now by now?)

Then again this all could be moot since Interwoven has been granted a patent for using versioning systems to create web sites (via SixLog ). This is complete rubbish and yet another absurd patent awarded. As Ben Trott points out many a web site developer/publisher has been using CVS for just this purpose.

Perhaps the US Patent Office would benefit from weblogging these things before they grant a claim and create another potential legal mess.

TikiText Update.

| No Comments

First, I'm enthusiastic and honored that there has been a good bit of interest. I appreciate all of the feedback thus far. Please keep them coming. Also of note…

Rael Dornfest released a beta of version 2 of his weblogging tool Blosxom (he's up to Beta 3 now) that adds plugin capabilities. One example plugin included a TikiText plugin. Excellent!

Rael also has been thinking about the crossover between Wikis and weblogs that and is experimenting using my TikiText module. (An interesting thought. More on this later.)

DJ Adams has fused Text::Tiki into his favorite Wiki tool MoinMoin concluding the 'natural environment' for a wiki-like markup language is ... in a Wiki. Very interesting. It certainly puts a few thoughts in my head to ponder.

Since TikiText has gone into regular use with some, I'm releasing a maintenance release to plug a few holes that have been discover in the TikiText processing engine. From the change log:

  • fixed bugged where formatting immediately in () and {} was not recognized.
  • fixed improper coding of anchor targets. (Paul Holbrook)

Download: .ZIP | .GZ | Reference Guide 0.17

I've been at work on the next major release of the engine with whatever bits of free time I salvage from my day. (I'm beginning to sound like Rael.) Here are a few highlights. First the bad news.

The symbol of CODE blocks will be changing.

I had a mental lapse here and should not have used the | (pipe) character since I knew table support was needed and the pipe is almost universally applied to that. Since I did say this was experimental I only feel a little bit bad about this. So let me clarify previous statements: this module/implementation is in beta, but the syntax is an alpha. (Does that make sense?)

I think the greater good is that Tiki supports tables that are similar to previous exsisting conventions then do something totall different and counter-intuitive. Hopefully you agree.

Lists in <blockquote>s. Nested <blockquote>s. etc.

I'm working on substantially refactoring the code to support nesting of elements better. I'm pretty sure the code coude shirk a bit and some minor performance gains achieved. We'll see.

Implement <image>

alt-text will be used to insert an image. The engine will automatically calculate the height and width.

Implement <tables>

Basic Wiki-style tables will be implemented in some form. The table will be of a similar style to Twiki however I'm considering using a symbol just inside the cell, denoted by a | (pipe), so I ignore leading and trailing whitespace in each cell. My reason is that authors should have the option to make tables readable without being parser – assuming they are using a fixed-width font.

Implement callback for WikiWords

While TikiText undeniable descends from Wiki notation, it was never my intention to replace it or have TikiText put into that use. There seems to be interested based on Rael and DJ's experience. Being one to please I will be adding a callback hook to Text::Tiki for those who are implementing this module in ways where this would be useful. WikiWord support will be completely optional. A WikiWord link is an implementation issue that varies and in some cases may not be warranted at all.

I've updated the TO DO section of the documentation accordingly. The TikiText Forum is still open here your comments and suggestions are welcomed.

I just learned that no one has been able to post any comments to the TikiText beta forum I setup. I failed to select them as open before posting. (How embarassing!) Fixed. My apologies to Todd Larason and anyone else I inconvienced. I really want feedback – constructive criticism and feature requests.

Times are Tough in NYC.

| No Comments

A former colleague of mine sent me a link to an article Economy Is Tough All Over, but in New York, It's Horrid that is running in the New York Times today. (Registration required.) What is most amusing to me is the picture featured with the article. That would be my former employer's logo on the wall way in the back and yes, that is my former employee cafeteria they are auctioning off.

On a more serious note, the article goes on to detail some astounding statistics to how much harder New York City has been hit by by this economic downturn. These statics just re-enforce what I said some time back about how incredibly glooming Silicon Alley has become. I've considered the entrepenurial route myself, but who would bother investing? There is no excitement. No new ventures. No pulse. The only exciting thing I've heard going on in town is some of Nick Denton's ventures – Gizmodo, Gawker, an unnamed porn blog and The Lafayette Project. The article states:

New York City has lost almost 176,000 jobs in two years — more than the population of many cities. The unemployment rate, which in the spring of 2001 had fallen to 5.3 percent, has been climbing steadily and jumped to 8.4 percent in December.

Also of note:

New York City has gone through boom and bust before, most recently during what Christine M. Cumming, director of research for the Federal Reserve Bank of New York, described as the long economic winter from 1989 through 1992. The entire region suffered then; Connecticut, New Jersey and New York State lost hundreds of thousands of jobs.

But what has surprised economists this time is that the economic carnage has been concentrated in New York City — and only New York City.

How tough is it?

Those who graduated from business school three or four years ago are in a particularly tough spot, he added. There are very few jobs for people in that category, if they want to stay in New York. He has even had M.B.A.'s apply to be his office manager, a job that pays about $60,000 a year. He said he had received almost 1,300 r鳵m鳠from applicants.

I'm not surprised. Makes me wonder if I'm making the right choice not having packed up the family and moved somewhere else.

TikiText is a text formatting notation and engine that I've been using myself and have decided to release to the public to vet further. In addition to the Text::Tiki module, this package includes a MovableType plugin that hooks the Tiki module into MT with a text formatting plugin and container tag. It also includes an alpha of a command-line tool for coverting TikiText.

Download: .ZIP | .GZ | Reference Guide

UPDATE: More on TikiText here.

UPDATE 2: Version 0.50 is now available here.

Background

Despite the notion of a universal canvas, rich authoring of content through
Web browsers is still rather poor and laborious to do. There have been attempts to create
WYSIWYG editor widgets to rectify this, however none of these
tools are reliable cross-platform and cross-browser not and often lack the flexiblity of its
read-only counterparts. This is unfortunate and nothing one person will be able to fix any
time soon leaving us to cope with brain dead <textarea> and plain text.

TikiText is an attempt to work with what we have and minimize (not completely solve) these shortcomings.

Recently I was faced with the task of architecting a way for non-developer non-markup saavy business
user who where previously using FrontPage to publish writings to the Web. Plain text (with no
formatting) was not going to cut it. Nor was teaching them XHTML language. I did an intensive
study of different structured text formatting notations that have been developed in the past.
these notations included a few different Wiki implements such as UseMod Wiki, MoinMoin Wiki,
Text::WikiFormat, in addition to Zope's Structured Text and HTML::FromText and Textile.

For one reason one reasons or another these notions fell short of my requirements. So in scratching my own itch I developed a notionation I call TikiText (Tiki said tee-kee not tick-E like wiki that its name pays some homage to) based on my observations and key learnings. I defined the design goals for TikiText are as follows:

  • Leverage existing text formatting notions.
  • Least amount of characters from plain text.
  • Use more intuitive and common plain text email conventions.
  • Abstract users from needing to know or understand markup when ever possible.
  • Make valid and semantical XHTML markup easy.
  • Easy to learn the basics. Richer functionality for those who want to dive in.

Caveats: While this code is quite usable it should be used with the understanding that it is
still somewhat experimental and is just being tested and properly documented. Feedback, bug
fixes, and feature implementations are appreciated. Furthermore, I realized this format is less
then perfect and falls short of its design goals. My hope is that it will be refined an tweaked
over time to optimize its effectiveness.

A MovableType plugin that gives access to MT's encode_xml filter for use on a whole block. Download: .ZIP | .GZ

A MovableType plugin that output both the entry body and more fields with one tag because you may find a need for it. Download: .ZIP | .GZ

Clay Shirky has published Power Laws, Weblogs, and Inequality discussing the development of a power law distribution in weblogging and garnering extensive discussion and debate in blogging circles.

Shirky concludes:

At the head will be webloggers who join the mainstream media (a phrase which seems to mean media we've gotten used to.) The transformation here is simple - as a blogger's audience grows large, more people read her work than she can possibly read, she can't link to everyone who wants her attention, and she can't answer all her incoming mail or follow up to the comments on her site. The result of these pressures is that she becomes a broadcast outlet, distributing material without participating in conversations about it.

Meanwhile, the long tail of weblogs with few readers will become conversational. In a world where most bloggers get below average traffic, audience size can't be the only metric for success. LiveJournal had this figured out years ago, by assuming that people would be writing for their friends, rather than some impersonal audience. Publishing an essay and having 3 random people read it is a recipe for disappointment, but publishing an account of your Saturday night and having your 3 closest friends read it feels like a conversation, especially if they follow up with their own accounts. LiveJournal has an edge on most other blogging platforms because it can keep far better track of friend and group relationships, but the rise of general blog tools like Trackback may enable this conversational mode for most blogs.

In between blogs-as-mainstream-media and blogs-as-dinner-conversation will be Blogging Classic, blogs published by one or a few people, for a moderately-sized audience, with whom the authors have a relatively engaged relationship. Because of the continuing growth of the weblog world, more blogs in the future will follow this pattern than today. However, these blogs will be in the minority for both traffic (dwarfed by the mainstream media blogs) and overall number of blogs (outnumbered by the conversational blogs.)

Not all are in agreement with Clay's assertions. Dave Winer writes The scaling equation for weblogs is, emphatically, not like BBSes, mail lists, not like the Well. (I certainly agree to Dave's call for Clay to setup a weblog.)

Shelley Powers has written an extensive rebuttal to Clay's points noting he has one failing in regards to his viewpoints as to social gatherings: he's an elitist. He believes there will always be an 'elite' grouping within any society, something I don't necessarily discount; however, from his writing and actions, he also tends to facilitate the mistaken belief that social groupings must follow fixed statistical patterns that support a static elite, and that we must all behave as the statistics dictate. And I say, what a load of hooie. An active conversation follows in the comments to the Shelley's post. Clay joins the discussion and says shame on me for using... old data, but his assertions on the power law curve still stands. Clay notes that was just published with similar views and more updated and statically relevant data. The numbers taken from the Blogging Ecosystem support Clay's assertions. (Clay has since updated his essay to use the same data.)

Despite the evidence of a weblogging following a power curve, Sam Ruby makes one of the most interesting observations thus far when he writes:

I'm listed in the Technorati top 100. By looking at the statistics there, 98.93% of the weblogs it tracks do NOT link to mine. 99.90% of the weblogs tracked have less inbound links than me.

I see no mountains here, only molehills.

The conversation continues.

[Cross posted to my O'reilly Network Weblog here.]

A Year Ago Yesterday.

| No Comments

Goodness. I missed my one-year anniversary in weblogging.

A year ago yesterday I posted My Love Hate Relationship with Perl my first weblog entry of any type. (Poor grammar and editing perserved!)

My former employer ran a short lived experiment with weblogging aimed at existing and potential clients and the press. It was a good thought, but with all of the internal chaos (layoffs, declining revenue and power struggles everywhere) the effort seemed rather frivolous and began languishing almost immediately. In hindsight, it was too structured, too planned and too stiff. It was a valuable learning experience for me. It also exposed me to weblogging and lead me to discovery MovableType.

That effort pretty much died when most of the staff who wrote to it either quit or was laid off or feared they where next. Over some drinks with my former colleagues including a few still employed in the systems administration department, I made the mistake of mentioning it was up and I had the last three entries even though I hadn't worked for them for nearly 6 months. That site disappeared almost immediately. Even The WayBack Machine hadn't recorded it. I've since received permission to republish my entries for posterity and will eventually get them all back up.

Faced with a similar situation of vanishing personal content from the public Internet, Jon Udell reposted his works elsewhere stating I took this step reluctantly, and would have preferred that the original namespace remain intact, but so be it. Those columns that have continuing value can now weave themselves back into the fabric of the Web.

I'm not even close to being the writer Jon is, but I hope in some sense the same is true here. This particular piece was an appropriate beginning for me at the time. Later I went on to write some things that I believe are still relevant and became the basis to content I've been covering here -- the beginnings of my leanings towards RESTful Web services in addition to early insights on FlashMX and J2ME. I take great delight in the no-design look of this post now after all of the over zealous creative folks I had to endure in my project work over the years.

Adam Kalsey has announced the availability of his latest MT plugin SimpleComments. He explains Comments and Trackbacks are merged into a single list. Comment counts include the number of TrackBack pings, and best of all, you don’t need to learn new MT tags in order to do this.

This is excellent and exactly what I'm meant when I wrote TrackBacks are Comments are TrackBacks. Well almost. Ideally I'd like to see systems such as MT make no differentiation on the backend also. MT is still storing TrackBack pings and comments seperately in the database and thereby requiring different (though similar) template tags be utilized. Nevertheless I'll take what progress I can get.

Without any excuses for not following my own advice, I've already implemented the plugin on the TrackBack NG forum page to create a merged chronlogical list of comments and pings in both HTML and in the syndication feed.

Good work Adam.

About this Archive

This page is an archive of entries from February 2003 listed from newest to oldest.

January 2003 is the previous archive.

March 2003 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Powered by Movable Type 4.2rc2-en