💾 Archived View for bbs.geminispace.org › u › stack › 4925 captured on 2024-05-10 at 12:40:32. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2024-03-21)
-=-=-=-=-=-=-
Re: "Small Cosmos fix: paths in entry URLs are now cleaned up so..."
A quick thought: instead of worrying about duplicate paths, check for _duplicate content_ by hashing it.
Since you already have to read each text (to scan for a referenced link), a fast FNV1a hash (a mul/xor per character) will stand for its identity, eliminating duplicates. Bernstein''s djb2 is another option, with a shift and two adds.
Love Cosmos, btw; thank you!
2023-08-30 · 8 months ago
🕹️ skyjake [OP/mod...] · 2023-08-30 at 13:55:
Thanks for the suggestion. Content hashing has crossed my mind before, and it would indeed automatically eliminate all duplicates, including mirrored domains where the URLs are actually different. Something to try out in the future...
Small Cosmos fix: paths in entry URLs are now cleaned up so that there are no relative references (`.` or `..`). This should remove some duplicate entries. Keep an eye out for weirdly malformed/broken URLs, in case I introduced any new bugs with this...