____                _            _     _          _   _ _____ __  __ _     
 / ___| ___ _ __ ___ | |_ _____  _| |_  | |_ ___   | | | |_   _|  \/  | |    
| |  _ / _ \ '_ ` _ \| __/ _ \ \/ / __| | __/ _ \  | |_| | | | | |\/| | |    
| |_| |  __/ | | | | | ||  __/>  <| |_  | || (_) | |  _  | | | | |  | | |___ 
 \____|\___|_| |_| |_|\__\___/_/\_\\__|  \__\___/  |_| |_| |_| |_|  |_|_____|
                                                                             
                               _            
  ___ ___  _ ____   _____ _ __| |_ ___ _ __ 
 / __/ _ \| '_ \ \ / / _ \ '__| __/ _ \ '__|
| (_| (_) | | | \ V /  __/ |  | ||  __/ |   
 \___\___/|_| |_|\_/ \___|_|   \__\___|_|   

Gemtext to HTML converter

A text/gemini document to text/html (Gemtext to HTML) converter. The idea is to create an HTML document that's easy to read the code of, and that follows as close as possible to the specifications[1][2] of a Gemtext document, but in HTML. It should also have a simple basic CSS stylesheet if one isn't supplied.

[1] Gemini specification (gemtext)

[2] Gemini specification (HTML)

The idea is to follow the simple line basis of the gemtext spec, and reflect that within HMTL code as well. The upshot of this is that for instance the "Unordered list items":

/^\* / && preformat_toggle == false {
  sub(/^\* /, "")
  print body_padding "<ul><li>" escape_html($0) "</li></ul>"
  next
}

Are actually represented within HTML, not as grouped together under one HTML unordered list element, but each individual list items is contained within it's own unordered list element. The only state linked to the parseing of the gemtext document, is within the preformatted text sections, which toggle on and off. As can be seen in the "Unordered list items" section, it only interprets the list items when not within a preformatted text section.

The preformatting toggle lines (code blocks, preformatted text sections) are the only lines that require state. More specifically, they toggle on and off preformatted text output. The toggle is also used when the preformatted text line matches the open and closing delimeters for preformatted text (three backticks "```"), so that it knows if this is opening a preformatted text section, or closing one.

/^```/ && preformat_toggle == false {
  preformat_toggle = true
  preformat_start = true

  sub(/^```[ \t]*/, "")
  sub(/[ \t]+$/,"")
  if ($0 != "") {
    preformat_title = $0
    if (preformat_title == "TOC" && TOC == true) {
      print_toc(TABLE_OF_CONTENTS)
    }
  }
  
  next
}

/^```/ && preformat_toggle == true {
  preformat_toggle = false

  if (preformat_start == true) { 
    preformat_start = false
  } else {
    print "</pre>"
    if (preformat_title != "") {
      print body_padding "  <figcaption id=\"" create_id(preformat_title) "\">"
      print body_padding "    " escape_attribute(preformat_title)
      print body_padding "  </figcaption>"
      print body_padding "</figure>"
    }
  }

  preformat_title = ""

  next
}

A couple of things to note, is that if there are no lines between the start and end of a preformatted text section, then nothing is printed. The toggle lines themselves, starting with three backticks ("```") are never output as line, per the spec.

The other notable part, is that the opening toggle line, can have a text section after it. The code trims any spaces from the begining, or end of the text, and uses that as the HTML title for the preformated section. There is a special case where the preformatting text line equals "TOC" and the "TOC" parameter passed in on the command line is "true". In this case a table of contents is printed at this point, created in the setup section. To aid accessibility in the HTML output, the preformatted text section, is wrapped in a "figure" tag with a figure caption if the text section is included[1]. The ARIA role of "figure" is used[2].

[1] Pre tag accessability

[2] ARIA figure role

The "print_toc" helper function loops over the passed in table of contents array in order. It first works out what level the current heading is, based on the number of "#"s at the begining of the heading, either 3, 2 or 1. The heading is then trimmed of all it's "#"'s, and preceding and trailing spaces. The ID is then created, and the indentation prefix is created. The HTML fragment link is then printed out.

A blank line is printed at the end of the TOC, this is so that if a preformatted text line with TOC (the label that gets a TOC printed) is added to the text, then it can be placed right up against the following text. That is so, if the text is converted with no TOC flagged, then there won't be an extra space in the output, the TOC prints that if the converted text is falgged to have a TOC added.

function print_toc(toc, _level, _heading, _id, _indent, _size, _i) {
  _size = toc["size"]
  for (_i = 1; _i <= _size; _i++) {
    _heading = toc[_i]

    if (_heading ~ /^###/) {
      _level = 3
    } else if (_heading ~ /^##[ \t]*/) {
      _level = 2
    } else if (_heading ~ /^#[ \t]*/) {
      _level = 1
    }

    _heading = create_heading(_heading)
    _id = create_id(_heading)

    _indent = ""
    if (_level == 1) {
      _indent = "→ "
    } else if (_level == 2) {
      _indent = "→ → "
    } else if (_level == 3) {
      _indent = "→ → → "
    }
    print body_padding "<p><a href=\"#" _id "\">" escape_html(_indent _heading) "</a></p>"
  }
  print body_padding "<p><br></p>"
}

The heading is created by removing up to the first three leading "#"s, and then any leading and trailing spaces.

function create_heading(string) {
    sub(/^(###|##|#)[ \t]*/,"",string)
    sub(/^[ \t]+/,"",string)
    sub(/[ \t]+$/,"",string)
    
    return string
}

The "create_id" helper function has two main sections. The first section creates an ID from the heading text, by triming all spaces from the start and end of the heading text. After that all spaces are replaced with dashes ("-"), and then all none dashes or alphanumeric characters are removed. Something like "This is a title *yep really*" would become "This-is-a-title-yep-really".

function create_id(string) {
    sub(/^[ \t]+/,"",string)
    sub(/[ \t]+$/,"",string)

    gsub(/[ \t]/,"-",string)
    gsub(/[^0-9a-zA-Z\-]/,"",string)
    return tolower(string)
}

When the closing toggle line is detected, either nothing is printed (because there were no lines between the open and closing toggle lines), or the closing HTML "</pre>" tag is printed.

The other line types, as laid out in the gemtext spec, are simpler, like the "Unordered list items", as they have no state, and only apply on a per line basis.

The quote lines are just take a line and enclose it in the HTML "<blockquote>" tag. This allows a long quote to be wrapped, and behaves like a quoted paragraph.

/^>/ && preformat_toggle == false {
  sub(/^>/, "")
  print body_padding "<blockquote>" escape_html($0) "</blockquote>"
  next
}

Heading lines are done in reverse order, to simplify the code and regexes. Most of the work is done in the "print_heading" function. Like the other simple line types, these don't get activated if they are in a preformatted section.

/^###/ && preformat_toggle == false {
  print_heading("h3", $0)
  next
}

/^##[ \t]*/ && preformat_toggle == false {
  print_heading("h2", $0)
  next
}

/^#[ \t]*/ && preformat_toggle == false {
  print_heading("h1", $0)
  next
}

The "print_heading" helper function has two main sections. The first section creates the heading using the helper function, and then an ID from the heading text using another helper function which is incorporated into an HTML "id" attribute.

The creation of the ID is controlled by the "TOC" parameter, if it is passed with the value of "true" on the command line, it will trigger an ID creation and a link below each heading to jump to the top of the document. The idea being that the headings can then be linked to via HTML fragment links, perhaps from an included table of contents (TOC) of links at the top of the page.

-v TOC=true

The second section then prints out an HTML heading defined by what heading type was passed in via the "type" variable e.g. "h1", "h2" or "h3".

function print_heading(type, heading, _id) {
  heading = create_heading(heading)
  
  if (TOC == true) {
    _id = " id=\"" create_id(heading) "\""
  } else {
    _id = ""
  }

  if (TOC == true) {
    print body_padding "<" type _id "><a href=\"#\">" escape_html(heading) "</a></" type  ">"
  } else {
    print body_padding "<" type _id ">" escape_html(heading) "</" type  ">"
  }
}

The table of contents is created in the setup stage, and then printed when a preformatted section titled "TOC" is found. It essentially gets passed in the file that is being converted, so that it can loop through and find any headings that aren't in a preformatted section, and store them in the passed in table of contents array.

function create_toc(file, toc, _preformat_toggle, _toc_count) {
  _preformat_toggle = false
  _toc_count = 0

  while (getline <file > 0) {
    if ($0 ~ /^```/) {
      if (_preformat_toggle == false) {
        _preformat_toggle = true
      } else {
        _preformat_toggle = false
      }
    } else if ($0 ~ /^(###|##|#)[ \t]*/ && _preformat_toggle == false) {
      toc[++_toc_count] = $0
    }
  }
  toc["size"] = _toc_count
  close(file)
}

Link lines just create an HTML link from the supplied URL and comment. It trims away the link chars "=>" used to denote a link line, and any spaces before the URL. The link and comment are then split matching on any spaces after the URL. If the "INLINE" flag has been set to true from the command line:

-v INLINE=true

Then the the URL is checked to see if it is an image, if so the link is turned into an image tage instead of a link, and the comment becomes the images alt title. Also if the "url" and the "link_name" are the same, a custom HTML data attribute is added called "data-noprint", so that the CSS media print type, knows not to add the URL to those links, as they already describe themselves in the text.

/^=>[ \t]*/ && preformat_toggle == false {
  sub(/^=>[ \t]*/, "")
  url = ""
  link_name = ""
  if (match($0,/[ \t]+/)) {
    url = substr($0,0,RSTART-1)
    link_name = substr($0,RSTART+RLENGTH)
    data_attribute = "" 
  } else {
    url = $0
    link_name = $0
    data_attribute = "data-noprint "
  }
  if (INLINE == true && is_image(url) == true) {
    print body_padding "<p><img src=\"" url "\" alt=\"" escape_html(link_name) "\"></p>"
  } else {
    print body_padding "<p><a " data_attribute "href=\"" url "\">" escape_html(link_name) "</a></p>"
  }
  next
}

The check image helper function, first trims the spaces from the start and end of the URL, and then chops the last four characters from the URL to make the suffix, and forces to lowercase at the same time. The suffix is then checked to see if it matches those used for Jpegs or PNGs. If it matches it returns true, otherwise false.

function is_image(url, _suffix) {
  sub(/^[ \t]+/,"",url)
  sub(/[ \t]+$/,"",url)
  
  _suffix = tolower(substr(url,length(url)-3))
  if (_suffix == ".png" || _suffix == ".jpg") {
    return true
  } else {
    return false
  }
}

The last line type is the text line type, and is the general one, as in, if it doesn't match any of the other lines types, it defaults to the text line type. The text line type has to pay attention to whether it is within a preformatted section or not, as it has to handle those cases slightly differently, hence the two sections of the if statement:

{
  if (preformat_toggle == true) {
    if (preformat_start == true) {
      preformat_start = false
      if (preformat_title == "") {
        print body_padding "<pre>"
      } else {
        print body_padding "<figure role=\"figure\" aria-labelledby=\"" create_id(preformat_title) "\">"
        print body_padding "  <pre>"
      }
    }
    print escape_html($0)
  } else {
    if ($0 ~ /^[ \t]*$/) {
      print body_padding "<p><br></p>"
    } else {
      print body_padding "<p>" escape_html($0) "</p>"
    }
  }
}

The first section deals with lines within a preformatted text section, and has a special case for the first line within a preformatted section, as it has to print the opening tag of the enclosing HMTL "<pre>" block, possibly including a title. Each of the lines is HTML escaped, to make sure that the HTML is outout correctly.

The second section deals with any other lines, and treats them all as text, and outputs each line as a paragraph, so that long lines wrap correctly. Blank lines need to be handled slightly differently, which is why there is a blank line check. This is because the blank line would just be consumed by the HTML paragraph tags, and although the HTML source would show the paragraph, the browser wouldn't show a blank line.

The program structure

Header
Begin section
Embedded CSS
Escape HTML
Escape attributes
basename
Check image
Print heading
Create heading
Create ID
Create table of contents
Print table of contents
Link lines
Preformatting toggle lines
Heading lines
Unordered list items
Quote lines
Text lines
End section

Setup

The begining section just sets up some control variables and constants. It also prints out the header of the HTML including the embedded CSS. two special constants are created to represent "true" and "false", they are essentially just the strings of those words aliased to a variable, so the varaible can then just be used like a keyword in expressions.

Defaults are set for all the parameters, so that if no parameters are passed is would be like calling the command like:

gmi2html -v INLINE=true -v LANG=en -v TOC=true file.gmi > file.html

The "LANG" part detects if a different language is specifed as a parameter on the command line, if none is found it defaults to English (en). The language and charset are set on the HTML doc, again following the gentext specification, and creating the HTML doc in a similar way. The charset is set to "utf-8" following the gentext spec.

The "TOC" part detects if a table of contents is required, by checking the "TOC" parameter passed in on the command line. If a table of contents is required, the "create_toc" helper function is called with the file being converted to HTML, and an array to store the headings in. The helper function, runs over the file and collects all the headings, ready for the "print_toc" to print them out.

The name of the gemtext file that is being converted to HTML is used as the HMTL page title, by just grabbing the filename via the "basename" function. "ARGV[1]" being the file passed into awk to run the script over.

BEGIN {
  true = "true"
  false = "false"
  preformat_toggle = false
  preformat_start = false
  preformat_title = ""
  body_padding = "    "

  if (LANG == "") {
      LANG = "en"
  }

  if (TOC == "") {
    TOC = false
  }

  if (INLINE == "") {
    INLINE = true
  }

  delete TABLE_OF_CONTENTS

  if (TOC == true) {
    create_toc(ARGV[1], TABLE_OF_CONTENTS)
  }

  print "<!DOCTYPE html>"
  print "<html lang=\"" LANG "\">"
  print "  <head>"
  print "    <meta charset=\"utf-8\">"
  print "    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">"
  print "    <title>" basename(ARGV[1]) "</title>"
  print "    <style>"
  print embedded_css(CSS)
  print "    </style>"
  print "  </head>"
  print "  <body>"
}

Embedded CSS

So that there is a nice default look to the HTML output, some simple CSS has been added. This also allows someone who wants to create some custom CSS, a starting point. The default embedded CSS, can be overidden with by passing the "CSS" parameter with the filename of the custom CSS file to be included:

-v CSS=file.css

The custom CSS file will be copied in, rather than linked to within the final HMTL output.

function embedded_css(css_file, _embedded_css) {
  _embedded_css = "\
      html {\n\
        font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;\n\
        font-weight: 400;\n\
        font-size: 16px;\n\
        -ms-text-size-adjust: 100%;\n\
        -webkit-text-size-adjust: 100%;\n\
        margin: 16px;\n\
      }\n\
\n\
      body {\n\
        line-height: 1.5rem;\n\
        max-width: 64em;\n\
        margin: auto;\n\
      }\n\
\n\
      h1, h2, h3, p, figure, blockquote, ul, li, a {\n\
        /* Don't break out */\n\
        overflow-wrap: break-word;\n\
        word-wrap: break-word;\n\
        margin-block-start: 0px;\n\
        margin-block-end: 0px;\n\
      }\n\
\n\
      h1, h2, h3 {\n\
        line-height: 1.5em;\n\
        font-weight: bold;\n\
      }\n\
\n\
      h1 {\n\
        font-size: 3em;\n\
      }\n\
\n\
      h2 {\n\
        font-size: 2em;\n\
      }\n\
\n\
      h3 {\n\
        font-size: 1.5em;\n\
      }\n\
\n\
      pre {\n\
        overflow-x: scroll;\n\
        background-color: rgb(245,245,245);\n\
        padding-left: 1em;\n\
        padding-right: 1em;\n\
        padding-top: 0.5em;\n\
        padding-bottom: 0.5em;\n\
        margin-top: 0px;\n\
        margin-bottom: 0px;\n\
        line-height: 1.3em;\n\
        font-family: Menlo, Monaco, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono, Courier New, monospace;\n\
      }\n\
\n\
      figure {\n\
        margin-inline-start: 0px;\n\
        margin-inline-end: 0px;\n\
      }\n\
\n\
      figcaption {\n\
        font-size: 0.75em;\n\
      }\n\
\n\
      blockquote:before, blockquote:after {\n\
        content: \"\";\n\
      }\n\
\n\
      blockquote {\n\
        border-left: 6px solid #ccc;\n\
        padding-left: 0.5em;\n\
        margin-top: 0px;\n\
        margin-bottom: 0px;\n\
        margin-left: 0.5em;\n\
      }\n\
\n\
      li {\n\
        padding-top: 0px;\n\
      }\n\
\n\
      a {\n\
        color: #33adff;\n\
        outline: 0 none;\n\
        text-decoration: none;\n\
      }\n\
\n\
      a:not([href^=\"#\"])::before {\n\
        content: \"\\2192\\00a0\";\n\
      }\n\
\n\
      img {\n\
        max-width: 100%;\n\
      }\n\
\n\
      /* Mobile view */\n\
      @media (max-width: 600px) {\n\
        body {\n\
          margin: 8px;\n\
        }\n\
      }\n\
\n\
      @media print {\n\
        a:not([href^=\"#\"]):not([data-noprint])::after {\n\
          content: \" (\" attr(href) \") \";\n\
        }\n\
      }"

  if (css_file != "") {
    _embedded_css = ""
    while (getline <css_file > 0) {
      if (_embedded_css == "") {
        _embedded_css = _embedded_css "      " $0
      } else {
        _embedded_css = _embedded_css "\n      " $0 
      }           
    }
    close(css_file)
  }

  return _embedded_css
}

The other helper functions

This function just escapes the string to alow it to include special HTML characters.

function escape_html(s) {
  gsub(/&/,"\\&amp;",s)
  gsub(/</,"\\&lt;",s)
  gsub(/>/,"\\&gt;",s)
  return s
}

Attributes which need to be included, mostly for preformated text sections, and images, need to be HTML escaped, also as attributes are enclosed in double quotes within the outputted HTML, any double quotes also need to be escaped.

function escape_attribute(s) {
  escape_html(s)
  gsub(/"/,"\\&quot;",s)
  return s
}

This function is used to get the basename of a file i.e. just the filename minus the filetype suffix, or the precedeing directories e.g. the basename for "/usr/local/test.txt" would be "test".

function basename(file, _a1, _a2, _n1) {
  _n1 = split(file, _a1, "/")
  split(_a1[_n1], _a2, ".")
  return _a2[1]
}

The end section just prints the closing HTML tags from the header section, completing the HTML document.

END {
  print "  <body>"
  print "</html>"
}

The Header

The header is just the bit of the file that tells the shell what to run it with. It goes at the very start of the file.

#!/usr/bin/awk -f

How to use

Call like:

gmi2html file.gmi > file.html

or if not using on a *nix system, but AWK is available:

awk -f gmi2html file.gmi > file.html

You can also call passing in custom CSS to be included instead of the default one:

gmi2html -v CSS={path to CSS file} file.gmi > file.html

e.g.

gmi2html -v CSS=file.css file.gmi > file.html

You can also specify the language type of the document (defaults to "en"):

gmi2html -v LANG=en file.gmi > file.html

You can also specify that a URI fragment table of contents (TOC) be created and can be used for navigation i.e. it will create id attributes for each of the headings so links can be used to jump to them. The TOC will replace a preformatted section titled with "TOC", this can be blank or you can add a manual TOC which will be swapped out with the HTML link version e.g.

 ```TOC
 ```

or

 ```TOC
 Start of the doc
 What is this all about
   Section 1
   Section 2
 The end of the doc
 ```

If no "TOC" section is found, no table of contents will be printed, but the HTML id attributes will still be created, as will the links to jump to the top of the document (defaults to "false"):

gmi2html -v TOC=true file.gmi > file.html

You can also specifiy inline images for png images (defaults to "true"):

gmi2html -v INLINE=true file.gmi > file.html

You can combine all, or none of the above e.g. create TOC links and inline images, and English as a language:

gmi2html -v INLINE=true -v LANG=en -v TOC=true file.gmi > file.html

Though because of the defaults this can just be:

gmi2html -v TOC=true file.gmi > file.html

The order of the parameters doesn't matter as long as they follow the parameter format of:

-v {PARAMETER}={VALUE}