Playing a SAX

There's a project that might start up at The Company involving lots of XML (eXtensible Markup Language) and C programming, so I've been poking around libxml [1]. I'm thinking I might even want to use this for mod_blog to validate HTML (HyperText Markup Language) (since libxml has an HTML parser, and about a quarter of the time I blow the coding on an entry and have to fix it).

One problem that crops up is the difficulting in getting errors as libxml is reading the document into memory. Sure, I can suck the HTML in with one call:

>
```
htmlDocPtr doc = htmlParseFile(filename,NULL);
```

(yes, it is that simple). But not seeing how to change the underlying reporting mechanism (not that I looked all that hard), I decide to switch to the SAX (Simple API for XML) interface for parsing. The SAX interface allows you to register functions to be called during portions of the HTML (or even XML) parsing. Yes, I can grab the errors as they happen, but now I have to resort to building the document into memory myself (more or less). But that's okay, since in theory, this will allow me to not only capture the errors, but filter the HTML as I see fit.

Two thing that popped right out at me.

First, the callback when a tag is found:

>
```
void **startElement**(void *user_data,
const xmlChar *name,
const xmlChar **attrs);
void **endelement**(void *user_data,
const xmlChar *name);
```
In these callbacks, the name parameter is the name of the element. The attrs parameter contains the attributes for the start tag. The even indicies in the array will be attribute names, the odd indicies are the values, and the final index will contain a NULL.

“Using the SAX Interface of LibXML [2]” (a tutorial)

Okay, seems simple enough. I write some code:

static void start_tag(void *data,const xmlChar *name,const xmlChar **attr)
{
  int i;

  /*--------------------------------------
  ; similar to printf() but functionally
  ; a bit better.
  ;
  ; And yes, this is how I format comments
  ; in C.
  ;--------------------------------------*/

  LineSFormat(StdoutStream,"$","<%a",name);

  for (i = 0 ; attr[i] != NULL ; i+= 2)
  {
    LineSFormat(StdoutStream,"$ $"," %a=\"%b\",attr[i],attr[i+2]);
  }
}

And the first time this code runs it crashes.

It seems that the documentation is a bit misleading—attr is only valid if there are attributes. Otherwise a NULL is passed in, which means you have to explicitely check attr for NULL!

Aaaaah!

Would it have been that difficult for the authors of libxml to always pass in a valid attr, even if it's two elements long that both contain NULL? (I suppose most programmers would check anyway just because, and the bloat continues)

The second thing. Catching the errors. Yeah. The call backs for those?

>
```
void sax_error(void *data,const char *msg, ... );
```

The errors (and warnings, and fatal errors) are passed back as a printf() style message.

So forget about intelligently handling the errors unless you want to parse the actual error messages.

Aaaaaaaarg!

[1] http://xmlsoft.org/

[2] http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html#start-end-element

Gemini Mention this post

Contact the author