💾 Archived View for gemini.susa.net › sitemap_script.gmi captured on 2022-07-16 at 13:37:29. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Bash script - sitemap with full-text-search

The script below generates a sitemap in my content directory based on the file files *.gmi that it finds there.

The files are sorted in reverse chronological order of their last-modified time - most recent first. I'm not sure if that's quite the best way to sort it, but it's more useful than filesystem order.

The link text is generated from the first content line of the file, if it's a heading line. Otherwise, the filename is used, but with underscores replaced with spaces.

I also build a full-text index using the Swish++ indexer. This is packaged in, for example, Debian, and can be installed with 'sudo apt-get install swish++'. This gives the index++ and search++ commands to build and query an index. The absolute minimum word-size is, by default, 4 letters unless it's an acronym, in which case it's 3 letters.

I wanted a 3 letter minumum word size (to include 'Vim' in my index!), so I compiled my own Swish++. See const Word_Min_Size in src/swishxx-config.h, I also set Word_Min_Vowels to 0 (e.g. otherwise 'rsync' is ignored). I simply copied the binaries into my private cgi_assets/ directory and run them from there.

Building was easy enough, but I had to add -lpthread to the link command (not sure why autoconfigure didn't add this). If you use the Debian package, be sure to change the name of the indexer and searcher in the scripts below (index++, and search++ for Debian version, index, search for Git version).

Swish++ repository on GitHub

The 'sitemap' script below can run from 'cgi-bin', so it can be invoked via a Gemini client to trigger a rebuild of the sitemap and index.

#!/bin/bash

MY_HOST="gemini.susa.net"
CONTENT_DIR="/home/kevin/gemini/content"
SWISH_INDEXER="../cgi_assets/index"


cd ${CONTENT_DIR}

{
    FILES=$(find . -name '*.gmi' \
        -not -name 'index.gmi' -not -name 'sitemap.gmi' \
        -printf '%C@ %P\n' | sort -nr | cut -d' ' -f 2-)

    echo -e "# Sitemap of ${MY_HOST}\n"
    echo -e "The pages are ordered by date/time of update, most recent first.\n"

    for f in ${FILES}; do
        t=$(head -1 $f);
        # If we have a header, then use this for the link text
        if [[ ${t:0:1} == "#" ]]; then
            echo -ne "=> /${f} ${t#\#* }\n";
        else # use the filename as link text instead
            f=${f#./}
            f=${f%.gmi}
            echo -ne "=> /${f} ${f//_/ }\n";
        fi
    done \
} >sitemap.gmi

# Feed the sitemap files into the Swish++ indexer. I compiled my own version
# and placed it in ../cgi-assets/index. The Debian version is named 'index++'
# The only reason I compiled my own was to lower the absolute minimum word
# threshold to 3 characters from 4 (e.g. Vim would not be indexed otherwise)

for f in ${FILES}; do echo "./$f"; done | ${SWISH_INDEXER} -

echo -ne "30 /sitemap.gmi\r\n"

My script to full text search looks like this: -

ftsearch (in cgi-bin)

#!/bin/bash

function urldecode() {
    # Replace-ALL (//) '+' with <space>
    : "${*//+/ }";
    # Replace-ALL (//) '%' with escape-x and evaluate (-e) on echo 
    echo -e "${_//%/\\x}";
}

if [[ "${QUERY_STRING}" == "" ]]; then
    echo -ne "10 Please enter a search term\r\n"
    exit
fi

echo -ne "20 text/gemini\r\n"

DECODED=$(urldecode "${QUERY_STRING}"|tr -cd '[A-Za-z0-9 _*]')

echo "# Full Text Search results: ${DECODED}"

cd ../../content/
../cgi_assets/search  "${DECODED}" | ./cgi-bin/swish2gmi.awk -v "query=$DECODED" sitemap.gmi -

And swish2gmi.awk, which converts the search results into more useful information.

#!/usr/bin/awk -f

BEGIN {
    IGNORECASE = 1

    # Remove and, or, near operators
    gsub(/ (or|and|near) |[*]/, " ", query)
    # Remove not clauses and their operand
    gsub(/ not +[^ ]+/, " ", query)
    # Collapse multiple spaces
    gsub(/[ ]{2,}/, " ", query)
    # Remove leading and traling spaces
    gsub(/(^[ ]{1,}|[ ]{1,}$)/, "", query)
    # Replace spaces with '|'
    gsub(/ /, "|",query)
}

# Store the sitemap for lookup
/^=> / {
    sitemap[$2] = $0
    next
}

# Swish++ output lines, identified by their integer rank
/^[0-9]+ / {

    key=substr($2, 2)
    if (sitemap[key]) {
        print sitemap[key] "\n"
    } else {
        print "=> " key "\n"
    }

    count = 0

    while((getline newline < $2) > 0) {
        if( newline ~ query) {
            print "* " newline
            if(++count == 5) {
                print "More than 5 lines match"
                break
            }
        }
    }
    close($2)

    print ""

    next
}

{
    # Default handling, only print lines from stdin
    #  e.g. not from sitemap.gmi which is used to build titles
    if(FILENAME == "-")
        print
}

END {
}