Hashtag Index Improvements
2020-01-09 | #crawlers #metadata #hashtags | @Acidus
I think using hashtags in gemtext are pretty cool.
I love @JBanana's #️⃣♊ Hashtags
@JBanana is off to a great start with their Hashtag index of all of Gemini space.
gemini://freeshell.de/hashtags/
However, the index is pretty noisy right now, which makes it hard to discover hashtags that seem intentional, or hashtags that have more than 1 or 2 pages. Here are some of the low-hanging fruit improvements I think can be made.
Filtering out accidental hashtags:
- Some of these hashtags are obviously CSS colors hex for example (#262133). Anything that matches #[0-9a-fA-F]{3} or #[0-9a-fA-F]{6} should probably be ignored.
- Ignore preformatted text blobs. A little trickier, but preformatted text is often used in gemtext for things like ASCII art, computer source code, and so it’s more likely to find accidental hashtags.
- Only index gemtext? I run crawlers on Gemini space and there is a lot of plaintext, source code, ASCII art, random file types with weird extensions, and odd MIME types. This is awesome, but leads again to accidental hashtags. Perhaps only text/gemini documents should be indexed for hashtags.
Interface improvements
The hashtag index itself can be improved to more easily surface compelling content.
- Include the number of occurrences next to a hashtag, so users can easily find popular topics.
- Have a "sorted by occurrence" view in addition to the alphabetical view.
- Consider hiding hashtags with only one occurrence.
Really crazy ideas:
- Use a thesaurus to group hashtags of similar meaning together (assuming they are single words).
- Look into stemming words so you can combine similar hashtags into the same hashtag (e.g. #arguing and #argument).
- Search? Finding pages with #docker AND #macos could be helpful.