I noticed that my blog no longer validated as XHTML 1.0 and I started investigating. On the Diary page, you can click on the comment links of the various blog posts (such as *Comments on 2011-07-05 Google Plus*) and you’ll get the comments *inlined*. This uses a tiny piece of javascript (and some CSS):
function togglecomments (id) { var elem = document.getElementById(id); if (elem.className=="commentshown") { elem.className="commenthidden"; } else { elem.className="commentshown"; } }
Thus, the HTML source already includes the comments in an appropriate div:
<div class="commenthidden" id="Comments_on_2011-07-05_Google_Plus"> … </div>
Links such as *Comments on 2011-07-05 Google Plus* will simply call the javascript function defined above and pass the *id* of the div to toggle:
<a href="javascript:togglecomments('Comments_on_2011-07-05_Google_Plus')">Comments on 2011-07-05 Google Plus</a>
That’s why the id attribute is important. The trivial solution is to simply use the blog post title (”2011-07-05 Google Plus”) but soon enough you’ll note that there are some interesting restrictions on the values of id attributes:
Now—how exactly is this defined? See the definition of Name in the XML spec:
Now, those are Unicode code points. But sadly, Oddmuse tries to be encoding *agnostic*. (I will have to revisit this decision, soon!)
Here’s a simple beginning of a regular expression that would identify well-formed names: `/^[:_A-Za-z][-.:_A-Za-z0-9]*/`
Now to extend it using the information above:
Unicode Codepoint UTF-8 encoding [#xC0-#xD6] c3 80 - c3 96 [#xD8-#xF6] c3 98 - c3 b6 [#xF8-#x2FF] c3 b8 - cb bf [#x370-#x37D] cd b0 - cd bd [#x37F-#x1FFF] cd bf - e1 bf bf [#x200C-#x200D] e2 80 8c - e2 80 8d [#x2070-#x218F] e2 81 b0 - e2 86 8f [#x2C00-#x2FEF] e2 b0 80 - e2 bf af [#x3001-#xD7FF] e3 80 81 - ed 9f bf [#xF900-#xFDCF] ef a4 80 - ef b7 8f [#xFDF0-#xFFFD] ef b7 b0 - ef bf bd [#x10000-#xEFFFF] f0 90 80 80 - f3 af bf bf
I started writing the following regular expression:
$regexp = "|\xc3[\x80-\x96\x98-\xb6\xb8-\xff]|[\xc4-\xca].|\xcb[\x00-\xbf]" . "|\xcd[\xb0-\xbd\xbf-\xff]|[\xce-\xDF].|\xe0..|\xe1[\x00-\xbe]." . "|\xe1\xbf[\x00-\xbf]|\xe2\x80[\x8c\x8d]" if $HttpCharset eq 'UTF-8'; $id = ":$id" unless $id =~ /^[:_A-Za-z]$regexp/; return join('', $id =~ m/([-.:_A-Za-z0-9]$regexp)/g);
Then I got tired and though, “if anybody reports an error, I’ll add the rest…”
#Web #XML #Oddmuse
(Please contact me if you want to remove your comment.)
⁂
You do know RegEx match open tags except XHTML self-contained tags?
RegEx match open tags except XHTML self-contained tags
– Harald 2011-07-12 12:00 UTC
---
Habe ich schon mal gesehen, ja. Und kennst du Oh Yes You Can Use Regexes to Parse HTML!?
Oh Yes You Can Use Regexes to Parse HTML!
In meinem Fall geht es aber nicht um Parsen von HTML sondern um die Transformation von Wiki Seitentiteln zu id Werten, welche ich im generierten HTML dann verwenden kann.
– Alex Schroeder 2011-07-12 12:15 UTC