gemini - kennedy.gemi.dev

💾 Archived View for dvejmz.srht.site › 2023-04-21-character-popularity.gmi captured on 2024-08-31 at 11:44:09. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

I have come up with a quick way to rank the popularity of a bunch of characters from a popular JRPG franchise using people's replies to a Reddit post, all from the command-line.

My approach involves extracting all the comment content from the desired Reddit thread (this one happens to be from the thriving Atelier JRPG community subreddit) and counting up the number of times each character's name is mentioned. Simple, isn't it?

First, I fetched the HTML document of the Reddit post from an alternative, server-side rendered web frontend for Reddit called

teddit

. This saved me the trouble of using headless Chrome to interact with the scrape-hostile official Reddit website. The official frontend renders most of the page content via JavaScript and obfuscates DOM attributes, making automated data extraction very difficult.

wget -O reddit-answers.html \
    https://teddit.net/r/Atelier/comments/kk3ux6/who_is_your_favourite_atelier_protagonist/

Naturally, fetching the raw comment content from the official Reddit API would be more efficient and elegant, but I really didn't feel like setting up an entire OAuth client and authentication flow for a one-off data extraction job. Moreover,

Reddit recently announced

they will begin charging for third-party usage of their APIs soon, so neither of these approaches may work that well in the future anyway.

Now we need to scrape all the user comments off the HTML document to process the contents. We can use XPath to achieve that. I used a CLI XPath processor called `xidel` for this. It should be readily available via most *nix systems package managers.

xidel \
    --html ./reddit-answers.html \
    -e '//div[contains(@class, "comment")]/*/div[@class="body"]/div/p/text()' \
    > reddit-answers.txt

This command will extract all the text from the DOM elements containing user comments and save it to a plaintext file we can now tokenise.

cat reddit-answers.txt \
    | tr -cd "[:alpha:][:space:]-'" \
    | tr ' [:upper:]' '\n[:lower:]' \
    | tr -s '\n' \
    | sed "s/^['-]*//;s/['-]$//" \
    | sort > reddit-answers-tokenised.txt

This command produces a text file where every word appears on a new line. This makes it easier to identify and count up each word.

Now that we have the file in a suitable format for counting, it should be plain sailing from here, right? Not quite. At the moment, the file contains *every* word written on the thread, which means 95% of the text is irrelevant to us as we're just interested in the character names. Now I said I wanted to keep things simple here, so rather than trying to somehow identify all the character names, I will discard all the non-character words in the source text using an English dictionary.

To do this, I employed ripgrep, which is a faster version of the text search and manipulation tool grep. grep would work just as well though. Both of these tools allow you to use a wordlist document to supply the list of specific terms to search or exclude in your input. This is a simple text document containing a list of newline-separated words. I will use a

simple English dictionary](https://www-personal.umich.edu/~jlawler/wordlist.html) wordlist. If you're using Debian or Ubuntu, there may be a dictionary wordlist preinstalled on `/usr/share/dict/` so you can just reference that file instead. Debian publish all of their [wordlist packages here

rg -Niwv \
    --regex-size-limit 800M \
    --dfa-size-limit 1G \
    -f ./wordlist reddit-answers-tokenised.txt \
    > reddit-names.txt

This `ripgrep` command will perform a case-insensitive word search against the supplied English dictionary file `./wordlist` and output the results to `reddit-names.txt`. I also pass in the `--regex-size-limit` and `--dfa-size-limit` flags to allocate enough resources for ripgrep to hold both the large dictionary and input files in memory. My settings are not overly scientific here, mind you. I have plenty of memory to spare on my 32GB RAM machine, so I just use large figures to guarantee memory will be plenty.

At this point, we should have a reasonably clean set of words, with most of the noisy, redundant English terms filtered off. Now we just need to run this final command to count up the number of times each word appears, exclude unpopular words (2 or fewer mentions) from the ranking, and display the results in descending numerical order:

cat ./reddit-names.txt \
    | uniq -c \
    | sort -nr \
    | rg -v '^ *[12] '

This is the final output,

16 ryza
13 firis
 8 suelle
 8 lol
 6 meruru
 5 totori
 5 shallie
 4 plachta
 3 rorona
 3 jrpgs
 3 escha

As you can see, it's far from perfect. A few undesired words may be counted in if your dictionary wordlist is not comprehensive, there is a lot of jargon, or there are many alternative or incorrect spellings in the source text. These should appear infrequently enough on the clean dataset so as not to become an issue.

If you'd like to run your own nerdy rankings on Reddit like me, this script puts it all together so that you can produce

rankings with a single command:

#!/usr/bin/env bash

TEDDIT_URL="$1"
WORDLIST_PATH="$2"

xidel \
    --html \
    --data="${TEDDIT_URL}" \
    -e '//div[contains(@class, "comment")]/*/div[@class="body"]/div/p/text()' \
    | tr -cd "[:alpha:][:space:]-'" \
    | tr ' [:upper:]' '\n[:lower:]' \
    | tr -s '\n' \
    | sed "s/^['-]*//;s/['-]$//" \
    | sort \
    | rg -Niwv --regex-size-limit 800M --dfa-size-limit 1G -f "${WORDLIST_PATH}" \
    | uniq -c \
    | sort -nr \
    | rg -v '^ *[12] '