💾 Archived View for gemini.susa.net › sitemap_script.gmi captured on 2022-07-16 at 13:37:29. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-11-30)
-=-=-=-=-=-=-
The script below generates a sitemap in my content directory based on the file files *.gmi that it finds there.
The files are sorted in reverse chronological order of their last-modified time - most recent first. I'm not sure if that's quite the best way to sort it, but it's more useful than filesystem order.
The link text is generated from the first content line of the file, if it's a heading line. Otherwise, the filename is used, but with underscores replaced with spaces.
I also build a full-text index using the Swish++ indexer. This is packaged in, for example, Debian, and can be installed with 'sudo apt-get install swish++'. This gives the index++ and search++ commands to build and query an index. The absolute minimum word-size is, by default, 4 letters unless it's an acronym, in which case it's 3 letters.
I wanted a 3 letter minumum word size (to include 'Vim' in my index!), so I compiled my own Swish++. See const Word_Min_Size in src/swishxx-config.h, I also set Word_Min_Vowels to 0 (e.g. otherwise 'rsync' is ignored). I simply copied the binaries into my private cgi_assets/ directory and run them from there.
Building was easy enough, but I had to add -lpthread to the link command (not sure why autoconfigure didn't add this). If you use the Debian package, be sure to change the name of the indexer and searcher in the scripts below (index++, and search++ for Debian version, index, search for Git version).
The 'sitemap' script below can run from 'cgi-bin', so it can be invoked via a Gemini client to trigger a rebuild of the sitemap and index.
#!/bin/bash MY_HOST="gemini.susa.net" CONTENT_DIR="/home/kevin/gemini/content" SWISH_INDEXER="../cgi_assets/index" cd ${CONTENT_DIR} { FILES=$(find . -name '*.gmi' \ -not -name 'index.gmi' -not -name 'sitemap.gmi' \ -printf '%C@ %P\n' | sort -nr | cut -d' ' -f 2-) echo -e "# Sitemap of ${MY_HOST}\n" echo -e "The pages are ordered by date/time of update, most recent first.\n" for f in ${FILES}; do t=$(head -1 $f); # If we have a header, then use this for the link text if [[ ${t:0:1} == "#" ]]; then echo -ne "=> /${f} ${t#\#* }\n"; else # use the filename as link text instead f=${f#./} f=${f%.gmi} echo -ne "=> /${f} ${f//_/ }\n"; fi done \ } >sitemap.gmi # Feed the sitemap files into the Swish++ indexer. I compiled my own version # and placed it in ../cgi-assets/index. The Debian version is named 'index++' # The only reason I compiled my own was to lower the absolute minimum word # threshold to 3 characters from 4 (e.g. Vim would not be indexed otherwise) for f in ${FILES}; do echo "./$f"; done | ${SWISH_INDEXER} - echo -ne "30 /sitemap.gmi\r\n"
My script to full text search looks like this: -
ftsearch (in cgi-bin)
#!/bin/bash function urldecode() { # Replace-ALL (//) '+' with <space> : "${*//+/ }"; # Replace-ALL (//) '%' with escape-x and evaluate (-e) on echo echo -e "${_//%/\\x}"; } if [[ "${QUERY_STRING}" == "" ]]; then echo -ne "10 Please enter a search term\r\n" exit fi echo -ne "20 text/gemini\r\n" DECODED=$(urldecode "${QUERY_STRING}"|tr -cd '[A-Za-z0-9 _*]') echo "# Full Text Search results: ${DECODED}" cd ../../content/ ../cgi_assets/search "${DECODED}" | ./cgi-bin/swish2gmi.awk -v "query=$DECODED" sitemap.gmi -
And swish2gmi.awk, which converts the search results into more useful information.
#!/usr/bin/awk -f BEGIN { IGNORECASE = 1 # Remove and, or, near operators gsub(/ (or|and|near) |[*]/, " ", query) # Remove not clauses and their operand gsub(/ not +[^ ]+/, " ", query) # Collapse multiple spaces gsub(/[ ]{2,}/, " ", query) # Remove leading and traling spaces gsub(/(^[ ]{1,}|[ ]{1,}$)/, "", query) # Replace spaces with '|' gsub(/ /, "|",query) } # Store the sitemap for lookup /^=> / { sitemap[$2] = $0 next } # Swish++ output lines, identified by their integer rank /^[0-9]+ / { key=substr($2, 2) if (sitemap[key]) { print sitemap[key] "\n" } else { print "=> " key "\n" } count = 0 while((getline newline < $2) > 0) { if( newline ~ query) { print "* " newline if(++count == 5) { print "More than 5 lines match" break } } } close($2) print "" next } { # Default handling, only print lines from stdin # e.g. not from sitemap.gmi which is used to build titles if(FILENAME == "-") print } END { }