2011-07-06 Regular Expression To Validate Id Attributes

I noticed that my blog no longer validated as XHTML 1.0 and I started investigating. On the Diary page, you can click on the comment links of the various blog posts (such as *Comments on 2011-07-05 Google Plus*) and you’ll get the comments *inlined*. This uses a tiny piece of javascript (and some CSS):

Diary

function togglecomments (id) {
   var elem = document.getElementById(id);
   if (elem.className=="commentshown") {
      elem.className="commenthidden";
   }
   else {
      elem.className="commentshown";
   }
}

Thus, the HTML source already includes the comments in an appropriate div:

<div class="commenthidden" id="Comments_on_2011-07-05_Google_Plus">
…
</div>

Links such as *Comments on 2011-07-05 Google Plus* will simply call the javascript function defined above and pass the *id* of the div to toggle:

<a href="javascript:togglecomments('Comments_on_2011-07-05_Google_Plus')">Comments on 2011-07-05 Google Plus</a>

That’s why the id attribute is important. The trivial solution is to simply use the blog post title (”2011-07-05 Google Plus”) but soon enough you’ll note that there are some interesting restrictions on the values of id attributes:

may start with a colon, a letter, or underscore
the rest of the name may contain the above and dashes, periods, and numbers
brackets, braces, and parenthesis are not allowed

Now—how exactly is this defined? See the definition of Name in the XML spec:

Name

`NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]`
`NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]`
`Name ::= NameStartChar (NameChar)*`

Now, those are Unicode code points. But sadly, Oddmuse tries to be encoding *agnostic*. (I will have to revisit this decision, soon!)

Oddmuse

Here’s a simple beginning of a regular expression that would identify well-formed names: `/^[:_A-Za-z][-.:_A-Za-z0-9]*/`

Now to extend it using the information above:

   Unicode Codepoint   UTF-8 encoding
   [#xC0-#xD6]              c3 80 - c3 96
   [#xD8-#xF6]              c3 98 - c3 b6
   [#xF8-#x2FF]             c3 b8 - cb bf
   [#x370-#x37D]            cd b0 - cd bd
   [#x37F-#x1FFF]           cd bf - e1 bf bf
   [#x200C-#x200D]       e2 80 8c - e2 80 8d
   [#x2070-#x218F]       e2 81 b0 - e2 86 8f
   [#x2C00-#x2FEF]       e2 b0 80 - e2 bf af
   [#x3001-#xD7FF]       e3 80 81 - ed 9f bf
   [#xF900-#xFDCF]       ef a4 80 - ef b7 8f
   [#xFDF0-#xFFFD]       ef b7 b0 - ef bf bd
   [#x10000-#xEFFFF]  f0 90 80 80 - f3 af bf bf

I started writing the following regular expression:

$regexp = "|\xc3[\x80-\x96\x98-\xb6\xb8-\xff]|[\xc4-\xca].|\xcb[\x00-\xbf]"
        . "|\xcd[\xb0-\xbd\xbf-\xff]|[\xce-\xDF].|\xe0..|\xe1[\x00-\xbe]."
        . "|\xe1\xbf[\x00-\xbf]|\xe2\x80[\x8c\x8d]"
    if $HttpCharset eq 'UTF-8';
$id = ":$id" unless $id =~ /^[:_A-Za-z]$regexp/;
return join('', $id =~ m/([-.:_A-Za-z0-9]$regexp)/g);

Then I got tired and though, “if anybody reports an error, I’ll add the rest…”

#Web #XML #Oddmuse

Comments

(Please contact me if you want to remove your comment.)

⁂

You do know RegEx match open tags except XHTML self-contained tags?

RegEx match open tags except XHTML self-contained tags

– Harald 2011-07-12 12:00 UTC

---

Habe ich schon mal gesehen, ja. Und kennst du Oh Yes You Can Use Regexes to Parse HTML!?

Oh Yes You Can Use Regexes to Parse HTML!

In meinem Fall geht es aber nicht um Parsen von HTML sondern um die Transformation von Wiki Seitentiteln zu id Werten, welche ich im generierten HTML dann verwenden kann.

– Alex Schroeder 2011-07-12 12:15 UTC

Alex Schroeder