[ANN] A Gemini crawler, for statistics about the geminispace

🗣️ From: Luke Emmet (luke (a) marmaladefoo.com)
📅 Sent: 2020-12-18 22:03
📧 Message 8 of 28


>> Could it be possible to show the distribution of page sizes in geminispace?
> Like this (the page was updated)?
>
> * Less than 1 kbyte: 18465 URLs (48.7 %)
> * 1 to 1000 kbytes: 15865 URLs (41.9 %)
> * More than 1000 kbytes: 3559 URLs (9.4 %)

Those bands are very wide.

How about in increments of 10^n? e.g. 1kb, 10kb, 100kb....

Also we can have a good general idea of other media types, but to filter 
on text/gemini would be ideal. If you are inclined!

> The code is available. For the data, I'm not decided yet. True, it is
> only public data, and there is not even the content of the pages, but
> I don't know yet if there isn't some privacy/ethical problem. Let me
> check.

How about if the data was anonymised, like to remove IP address, domain 
name, path and file name and replaced by anonymous labels, like this

  - Domain name: "Domain1" ... "DomainN"
  - path: "Path1" ..."PathN"

but then still to include other details like:

resource size, media type, encoding, etc

That would still be a very useful for statistical analysis in the 
aggregate, without revealing any identifiable info?

  - Luke

---

Previous in thread (7 of 28): 🗣️ Petite Abeille (petite.abeille (a) gmail.com)

Next in thread (9 of 28): 🗣️ Sean Conner (sean (a) conman.org)

View entire thread.