💾 Archived View for gemi.dev › gemini-mailing-list › 000541.gmi captured on 2024-08-31 at 17:29:09. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-12-28)

-=-=-=-=-=-=-

[ANN] A Gemini crawler, for statistics about the geminispace

1. Stephane Bortzmeyer (stephane (a) sources.org)

I'm running a Gemini crawler, which gathers metadata about the
geminispace. The goal is not to make a search engine but to survey the
geminispace.

You can find the current results (the crawler did not crawl the entire
space yet):

gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi

The reference site:

gemini://gemini.bortzmeyer.org/software/lupa/

The source code, with an issue tracker (bug reports and improvment
requests are very welcome):

https://framagit.org/bortzmeyer/lupa

Link to individual message.

2. Sean Conner (sean (a) conman.org)

It was thus said that the Great Stephane Bortzmeyer once stated:
> I'm running a Gemini crawler, which gathers metadata about the
> geminispace. The goal is not to make a search engine but to survey the
> geminispace.
> 
> You can find the current results (the crawler did not crawl the entire
> space yet):
> 
> gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi
> 
> The reference site:
> 
> gemini://gemini.bortzmeyer.org/software/lupa/
> 
> The source code, with an issue tracker (bug reports and improvment
> requests are very welcome):
> 
> https://framagit.org/bortzmeyer/lupa

  Very cool!  Thanks for the work.

  One stat I haven't seen yet (yours or from GUS) is a breakdown of
langauge.  How many pages had a lang parameter, then a breakdown by
language, how many multiple languages per parameters (for example,
"lang=en,fr").

  -spc

Link to individual message.

3. Stephane Bortzmeyer (stephane (a) sources.org)

On Wed, Dec 16, 2020 at 06:16:53PM -0500,
 Sean Conner <sean at conman.org> wrote 
 a message of 27 lines which said:

> > You can find the current results (the crawler did not crawl the entire
> > space yet):
> > 
> > gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi

>   One stat I haven't seen yet (yours or from GUS) is a breakdown of
> langauge.  How many pages had a lang parameter, then a breakdown by
> language, how many multiple languages per parameters (for example,
> "lang=en,fr").

Just ask :-) Now done:

gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi

I note:


France.




find suprising.


Link to individual message.

4. Luke Emmet (luke (a) marmaladefoo.com)


On 16-Dec-2020 15:05, Stephane Bortzmeyer wrote:
> I'm running a Gemini crawler, which gathers metadata about the
> geminispace. The goal is not to make a search engine but to survey the
> geminispace.
>
> You can find the current results (the crawler did not crawl the entire
> space yet):
>
> gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi
(Due to user error, my reply was supposed to go to the list, but was 
sent privately, so I'm re-posting)

This is very interesting thank you.

Could it be possible to show the distribution of page sizes in 
geminispace? I know you show the average page size, but to get a better 
view of what is typical and the range would be good. For example does it 
follow a power law etc...

Is there any raw data available?

  - Luke

Link to individual message.

5. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 16, 2020, at 16:05, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> I'm running a Gemini crawler, which gathers metadata about the
> geminispace.

Along those lines, a couple of one-liners to gather various host & content information:


# IP address(es)
# dig +short mozz.us
174.138.124.169


# geolocation
# curl --silent https://tools.keycdn.com/geo.json?host=174.138.124.169 | jq | gron
json = {};
json.data = {};
json.data.geo = {};
json.data.geo.asn = 14061;
json.data.geo.city = "North Bergen";
json.data.geo.continent_code = "NA";
json.data.geo.continent_name = "North America";
json.data.geo.country_code = "US";
json.data.geo.country_name = "United States";
json.data.geo.datetime = "2020-12-18 09:04:57";
json.data.geo.host = "174.138.124.169";
json.data.geo.ip = "174.138.124.169";
json.data.geo.isp = "DIGITALOCEAN-ASN";
json.data.geo.latitude = 40.793;
json.data.geo.longitude = -74.0247;
json.data.geo.metro_code = 501;
json.data.geo.postal_code = "07047";
json.data.geo.rdns = "174.138.124.169";
json.data.geo.region_code = "NJ";
json.data.geo.region_name = "New Jersey";
json.data.geo.timezone = "America/New_York";
json.description = "Data successfully received.";
json.status = "success";


# certificate info
# cfssl certinfo -domain mozz.us | jq | gron
json = {};
json.authority_key_id = "A8:4A:6A:63:04:7D:DD:BA:E6:D1:39:B7:A6:45:65:EF:F3:A8:EC:A1";
json.issuer = {};
json.issuer.common_name = "Let's Encrypt Authority X3";
json.issuer.country = "US";
json.issuer.names = [];
json.issuer.names[0] = "US";
json.issuer.names[1] = "Let's Encrypt";
json.issuer.names[2] = "Let's Encrypt Authority X3";
json.issuer.organization = "Let's Encrypt";
json.not_after = "2021-01-21T01:36:54Z";
json.not_before = "2020-10-23T01:36:54Z";
json.pem = "-----BEGIN 
CERTIFICATE-----\nMIIGJzCCBQ+gAwIBAgISBAK7/ku/XjgmczVT7mmM1cEcMA0GCSqGSIb3D
QEBCwUA\nMEoxCzAJBgNVBAYTAlVTMRYwFAYDVQQKEw1MZXQncyBFbmNyeXB0MSMwIQYDVQQD\n
ExpMZXQncyBFbmNyeXB0IEF1dGhvcml0eSBYMzAeFw0yMDEwMjMwMTM2NTRaFw0y\nMTAxMjEwM
TM2NTRaMBIxEDAOBgNVBAMTB21venoudXMwggEiMA0GCSqGSIb3DQEB\nAQUAA4IBDwAwggEKAo
IBAQDZ4pi5q0QlIxAo8sKNBgInG1BGH584lRghCdnrBsZD\n68IuFlJ3V3wrnfsaNv8IZOHRkvx
N2uxDo/oVxCCSNug/Ne4b+Pqw7U8thB9zL46A\nMbrHVtAmloykToDRlOHv/OLp2YRQiW7cD57l
xot+9+TPlHsAuMccQXQDMbmhT6bf\nirO4m6F6gRf478YLLVOmpxkLd87dhHa7gO3NwmRroIB/D
MLdQRAVAMbdDGTjdCrA\nlToWeHOnPNBLKPmI6M9DCqEXoTbIa9OhpJmo+txlS85O8/RHzXu2fV
kgnEnBIcsE\n/ZEh5ytov1SogIXzNQgIJFesaWCqgBPLun4molEnfcq5AgMBAAGjggM9MIIDOTA
O\nBgNVHQ8BAf8EBAMCBaAwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsGAQUFBwMCMAwG\nA1UdEw
EB/wQCMAAwHQYDVR0OBBYEFI3x/VWfHHCG1IfE32kGHZPG4RC6MB8GA1Ud\nIwQYMBaAFKhKamM
Efd265tE5t6ZFZe/zqOyhMG8GCCsGAQUFBwEBBGMwYTAuBggr\nBgEFBQcwAYYiaHR0cDovL29j
c3AuaW50LXgzLmxldHNlbmNyeXB0Lm9yZzAvBggr\nBgEFBQcwAoYjaHR0cDovL2NlcnQuaW50L
XgzLmxldHNlbmNyeXB0Lm9yZy8wgfIG\nA1UdEQSB6jCB54ILYXBpLm1venoudXOCE2FzdHJvYm
90YW55Lm1venoudXOCDGNo\nYXQubW96ei51c4ILZGV2Lm1venoudXOCDmdlbWluaS5tb3p6LnV
zggtnaXQubW96\nei51c4IRZ29vZHZpYmVzLm1venoudXOCDmdvcGhlci5tb3p6LnVzghRtYWls
LWFy\nY2hpdmUubW96ei51c4IMbWFpbC5tb3p6LnVzgg9taWNoYWVsLm1venoudXOCB21v\neno
udXOCDnBvcnRhbC5tb3p6LnVzgg1wcm94eS5tb3p6LnVzggt3d3cubW96ei51\nczBMBgNVHSAE
RTBDMAgGBmeBDAECATA3BgsrBgEEAYLfEwEBATAoMCYGCCsGAQUF\nBwIBFhpodHRwOi8vY3BzL
mxldHNlbmNyeXB0Lm9yZzCCAQQGCisGAQQB1nkCBAIE\ngfUEgfIA8AB3AJQgvB6O1Y1siHMfgo
siLA3R2k1ebE+UPWHbTi9YTaLCAAABdVNQ\n7ygAAAQDAEgwRgIhALmUv4K/i3UcPYCIseckN2n
fpk8g+Gi4MZRq6Ybr8/JXAiEA\n00kRkd+19OB2j4VASwsoQatWKasN+yTMnkQWOf2YMbsAdQB9
PvL4j/+IVWgkwsDK\nnlKJeSvFDngJfy5ql2iZfiLw1wAAAXVTUO9TAAAEAwBGMEQCICOymh52O
gxx/wjJ\ngo5TEIgfEDtgXvKdfBsVtibLeZQWAiAyiUPq2MBPxn9+KJFhhxE8LRI9VIhpWnHV\n
5JlOp2dIYzANBgkqhkiG9w0BAQsFAAOCAQEARqt9QyY4Fq7SBindKcHyrsQ9JtqB\nvfZy5yDKz
FwuQZKmk2pxOzapCNRLNeyiEalfIFzrtHI11gr1ZEFHL1rA7pO3ud/j\nM2r0lmvNf8W+kUVf4G
ng0TqGyRRh28RDNDCaz8uaYeg5C6BPUIZtHbO6qJBNme2W\noS4Qp0fjjAUvSQwTKDEh5GKnZv4
AnJifMRqSXgZ+HgsamqydODRRTszwCMTMGBhO\naUOf+wF9l90T9N3MLDxSdixh4/qMuE0LpIsy
eLJJ08ZsmOvOPtar0zxUw8AXMtGG\n62wmZhlY+vXD4Nk6cKTepSCVEHmCLTtckbHfn518wCQEv
JZYYVApG0y1QQ==\n-----END CERTIFICATE-----\n";
json.sans = [];
json.sans[0] = "api.mozz.us";
json.sans[1] = "astrobotany.mozz.us";
json.sans[2] = "chat.mozz.us";
json.sans[3] = "dev.mozz.us";
json.sans[4] = "gemini.mozz.us";
json.sans[5] = "git.mozz.us";
json.sans[6] = "goodvibes.mozz.us";
json.sans[7] = "gopher.mozz.us";
json.sans[8] = "mail-archive.mozz.us";
json.sans[9] = "mail.mozz.us";
json.sans[10] = "michael.mozz.us";
json.sans[11] = "mozz.us";
json.sans[12] = "portal.mozz.us";
json.sans[13] = "proxy.mozz.us";
json.sans[14] = "www.mozz.us";
json.serial_number = "349379594475839169414317025618006180741404";
json.sigalg = "SHA256WithRSA";
json.subject = {};
json.subject.common_name = "mozz.us";
json.subject.names = [];
json.subject.names[0] = "mozz.us";
json.subject_key_id = "8D:F1:FD:55:9F:1C:70:86:D4:87:C4:DF:69:06:1D:93:C6:E1:10:BA";


# retrieve content type
# openssl s_client -quiet -crlf -connect mozz.us:1965 <<< 
gemini://mozz.us/ 2>/dev/null | head -1
20 text/gemini; lang=en


# double check content type
# openssl s_client -quiet -crlf -connect mozz.us:1965 <<< 
gemini://mozz.us/ 2>/dev/null | file --brief --mime-type --mime-encoding -
text/plain; charset=utf-8


# validate encoding
# openssl s_client -quiet -crlf -connect mozz.us:1965 <<< 
gemini://mozz.us/ 2>/dev/null | iconv -f utf-8 -t utf-8 > /dev/null; echo $?
0


# guess language
# echo $(openssl s_client -quiet -crlf -connect mozz.us:1965 <<< 
gemini://mozz.us/ 2>/dev/null ) | polyglot detect | cut -d' ' -f1 | uniq
English

Link to individual message.

6. Stephane Bortzmeyer (stephane (a) sources.org)

On Fri, Dec 18, 2020 at 12:12:47PM +0000,
 Luke Emmet <luke at marmaladefoo.com> wrote 
 a message of 23 lines which said:

> > gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi

> Could it be possible to show the distribution of page sizes in geminispace?

Like this (the page was updated)?



> Is there any raw data available?

The code is available. For the data, I'm not decided yet. True, it is
only public data, and there is not even the content of the pages, but
I don't know yet if there isn't some privacy/ethical problem. Let me
check.

Link to individual message.

7. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 18, 2020, at 15:37, Petite Abeille <petite.abeille at gmail.com> wrote:
> 
> # geolocation
> # curl --silent https://tools.keycdn.com/geo.json?host=174.138.124.169 | jq | gron

# while at it
# whois mozz.us | grep @
e-mail:       technical1 at registry.neustar
e-mail:       registrytechnical2 at neustar.biz
Registrar Abuse Contact Email: registrar-abuse at google.com
Registrant Email: lazar.michael22 at gmail.com
Admin Email: lazar.michael22 at gmail.com
Tech Email: lazar.michael22 at gmail.com

Link to individual message.

8. Luke Emmet (luke (a) marmaladefoo.com)


>> Could it be possible to show the distribution of page sizes in geminispace?
> Like this (the page was updated)?
>
> * Less than 1 kbyte: 18465 URLs (48.7 %)
> * 1 to 1000 kbytes: 15865 URLs (41.9 %)
> * More than 1000 kbytes: 3559 URLs (9.4 %)

Those bands are very wide.

How about in increments of 10^n? e.g. 1kb, 10kb, 100kb....

Also we can have a good general idea of other media types, but to filter 
on text/gemini would be ideal. If you are inclined!

> The code is available. For the data, I'm not decided yet. True, it is
> only public data, and there is not even the content of the pages, but
> I don't know yet if there isn't some privacy/ethical problem. Let me
> check.

How about if the data was anonymised, like to remove IP address, domain 
name, path and file name and replaced by anonymous labels, like this

  - Domain name: "Domain1" ... "DomainN"
  - path: "Path1" ..."PathN"

but then still to include other details like:

resource size, media type, encoding, etc

That would still be a very useful for statistical analysis in the 
aggregate, without revealing any identifiable info?

  - Luke

Link to individual message.

9. Sean Conner (sean (a) conman.org)

It was thus said that the Great Stephane Bortzmeyer once stated:
> On Wed, Dec 16, 2020 at 06:16:53PM -0500,
>  Sean Conner <sean at conman.org> wrote 
>  a message of 27 lines which said:
> 
> > > You can find the current results (the crawler did not crawl the entire
> > > space yet):
> > > 
> > > gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi
> 
> >   One stat I haven't seen yet (yours or from GUS) is a breakdown of
> > langauge.  How many pages had a lang parameter, then a breakdown by
> > language, how many multiple languages per parameters (for example,
> > "lang=en,fr").
> 
> Just ask :-) Now done:

  Thanks.

> gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi
> 
> I note:
> 
> * French is the second language after english. Cocorico, as we say in
> France.
> 
> * There is one page in finnish.

  And I see one page that is both English and Japanese.

> * There are more HTML than Markdown pages on the geminispace, which I
> find suprising.

  Not really, as I've come across one Gemini site that only serves up HTML.

> * There is one page in EBCDIC and one in CP-437 :-)

  Now *THAT* is surprising.

  -spc

Link to individual message.

10. Stephane Bortzmeyer (stephane (a) sources.org)

On Fri, Dec 18, 2020 at 05:45:33PM -0500,
 Sean Conner <sean at conman.org> wrote 
 a message of 40 lines which said:

> > * There are more HTML than Markdown pages on the geminispace, which I
> > find suprising.
> 
>   Not really, as I've come across one Gemini site that only serves
>   up HTML.

Yes, but Gemini was supposed to be about lightness and soberness. So,
Markdown seems a better fit.

> > * There is one page in EBCDIC and one in CP-437 :-)
> 
>   Now *THAT* is surprising.

It is actually a capsule dedicated to tests <gemini://egsam.pitr.ca/>.

Link to individual message.

11. Solderpunk (solderpunk (a) posteo.net)

On Wed Dec 16, 2020 at 4:05 PM CET, Stephane Bortzmeyer wrote:
> I'm running a Gemini crawler, which gathers metadata about the
> geminispace. The goal is not to make a search engine but to survey the
> geminispace.

This is very cool, thanks for your work!

I would be curious to see more thorough statistics on TLS certificates
in Geminispace.  It's nice that you have the Let's Encrypt percentage on
there, but I'd also like to know about self-signed certificates, what
the distribution of sizes, key types, lifepans etc. is.  But I know this
information is *not* easy to get at in Python without external
dependencies, so don't feel obligated to bang your head against it.

I have plans to write a certificate observatory daemon in 2021, with a
simple Gemini interface so that TOFU clients can query it regarding new
certs.  It should be straightforward to generate this kind of
information as a side-effect, so my curiosity on this front will be
satisfied one way or another sooner or later.

Cheers,
Solderpunk

Link to individual message.

12. Stephane Bortzmeyer (stephane (a) sources.org)

On Sat, Dec 19, 2020 at 07:55:05PM +0100,
 Solderpunk <solderpunk at posteo.net> wrote 
 a message of 22 lines which said:

> I'd also like to know about self-signed certificates,

I'm not expert enough on X.509 but do note that the obvious algorithm
to detect self-signed certificates (checking that issuer == subject)
does not work well in the geminispace where many certs are signed
by... someone (not a known CA but not the subject).

> But I know this information is *not* easy to get at in Python
> without external dependencies,

Most of it is easy to get
<https://framagit.org/bortzmeyer/lupa/-/issues/7>

> I have plans to write a certificate observatory daemon in 2021, with
> a simple Gemini interface so that TOFU clients can query it
> regarding new certs.

If it is just for surveying, fine. If it is to turn it into a security
system, be careful, there are many traps. Who can put new
certificates, how to be sure that clients will check it, etc.

gemini://gemini.bortzmeyer.org/rfc-mirror/rfc6962.txt

Link to individual message.

13. Stephane Bortzmeyer (stephane (a) sources.org)

On Fri, Dec 18, 2020 at 10:03:00PM +0000,
 Luke Emmet <luke at marmaladefoo.com> wrote 
 a message of 34 lines which said:

> How about if the data was anonymised, like to remove IP address, domain
> name, path and file name and replaced by anonymous labels, like this

Note that you did not receive my private message since your email server 
denies access to mine.

<luke at marmaladefoo.com>: host mx1.mythic-beasts.com[2a00:1098:0:86:1000:0:2:1]
    said: 550 Block listed: https://www.spamhaus.org/sbl/query/SBLCSS (in reply
    to MAIL FROM command)

(And, no, I do not intend to start begging Spamhaus to unlist me.)

Link to individual message.

14. Luke Emmet (luke (a) marmaladefoo.com)


On 19-Dec-2020 19:20, Stephane Bortzmeyer wrote:
> Note that you did not receive my private message since your email 
> server denies access to mine.
> <luke at marmaladefoo.com>: host mx1.mythic-beasts.com[2a00:1098:0:86:1000:0:2:1]
>      said: 550 Block listed: https://www.spamhaus.org/sbl/query/SBLCSS (in reply
>      to MAIL FROM command)
>
> (And, no, I do not intend to start begging Spamhaus to unlist me.)
Sorry about that - my domain ISP runs a fairly typical setup - I don't 
usually have problems getting email from people. I certainly don't have 
the skills or inclination to run my own mail server, so I'm not in a 
position to adjust the email server.

Feel free to reply to the list, or if you want to send a personal email, 
you might try a different email address: luke [dot] emmet [at] gmail 
[dot] com

  - Luke

Link to individual message.

15. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 16, 2020, at 16:05, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> I'm running a Gemini crawler

This is a very brave endeavor. Survivability is key in an experimental, 
buggy, hostile, or malicious environment. 

The same rules of cautiousness applies to any user-agent, but even more so 
to headless, automated bots.

There are many possible traps out there. Nothing new under the sun. The 
interweb has been through this for the last few decades: pranksters vs bots. 

As this is a known, solved problem, I will not bore you with specific 
details, assuming instead everyone knows what they are doing, eyes wide open.

Just be cautious. Assume hostility, aim for survivability.

Link to individual message.

16. Peter Vernigorov (pitr.vern (a) gmail.com)

I don?t think details are boring here. Would you mind listing some of the
problems and possible solutions/workarounds?

On Tue, Dec 22, 2020 at 12:57 Petite Abeille <petite.abeille at gmail.com>
wrote:

>
>
> > On Dec 16, 2020, at 16:05, Stephane Bortzmeyer <stephane at sources.org>
> wrote:
> >
> > I'm running a Gemini crawler
>
> This is a very brave endeavor. Survivability is key in an experimental,
> buggy, hostile, or malicious environment.
>
> The same rules of cautiousness applies to any user-agent, but even more so
> to headless, automated bots.
>
> There are many possible traps out there. Nothing new under the sun. The
> interweb has been through this for the last few decades: pranksters vs
> bots.
>
> As this is a known, solved problem, I will not bore you with specific
> details, assuming instead everyone knows what they are doing, eyes wide
> open.
>
> Just be cautious. Assume hostility, aim for survivability.
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201222/505b
3292/attachment.htm>

Link to individual message.

17. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 22, 2020, at 13:53, Peter Vernigorov <pitr.vern at gmail.com> wrote:
> 
> I don?t think details are boring here.

Really? Didn't get that vibe. Anyway.

> Would you mind listing some of the problems and possible solutions/workarounds?

Two simple examples: resource exhaustion and content poisoning. 

Exhaustion is the most trivial one, The adversary (a technical term, not a 
value judgment) tries to slow you down or fill you up or reach various 
limits on your side. E.g. throttling connection, infinite output, hanged 
connection, any combinations of the above, etc, etc...

This is easy to deal with: always limit everything you do, be it reading, 
writing, waiting, computing, whatnot. Eventually something will reach 
these limits and let you out of the trap. You can then mark the site as 
hostile and/or dysfunctional. Not always clear which one is which: 
incompetence or malice. 

For example, assuming the network stack goes through, a client has to read 
at most the first 1024 + some bytes of a server response to figure out 
what to do. Nothing more. Don't expect a well-formed response line. Assert 
it. Always validate. Continuously. Drop the connection as soon as 
something is not right. Always remember what happened.

Of course, there are downsides  to resource exhaustion for the adversary, 
as it's a sort of self-inflicted denial of service. Oh well.

Content poisoning is more fun. It can be anything from feeding you 
continuous junk (exhaustion + poisoning), well formed, but ill-intentioned 
logic bombs, busy beaver, wild goose chase, the list goes on.

For example, a trivial chase is infinite redirects. Got to stop 
eventually. Another limit.

Another one could be well formed text/gemini, but with  junky links. Same as above.

Again this is easy to identify statistically, marking the adversary as dysfunctional. 

You can them move on, or retaliate, depending on the mood. 

User-agents could also federate such information and use them in 
meaningful, if ominous, ways.

This is not a one way street: user-agents, specially bots, can do a lot of 
damage at scale.

Always keep in mind Hanlon's razor: "never attribute to malice that which 
is adequately explained by stupidity".

https://en.wikipedia.org/wiki/Hanlon%27s_razor

Just my 2?. Have fun.

Link to individual message.

18. ew.gemini (ew.gemini (a) nassur.net)

Hello,

Peter Vernigorov writes:

> I don?t think details are boring here. Would you mind listing some of the
> problems and possible solutions/workarounds?

This

gemini://alexschroeder.ch/page/2020-12-22_Website_down%2C_disk_full%2C_logs_crazy
https://alexschroeder.ch/wiki/2020-12-22_Website_down%2c_disk_full%2c_logs_crazy

might have been caused by a crawler.

Cheers,
Erich


-- 
Keep it simple!

Link to individual message.

19. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 22, 2020, at 16:43, ew.gemini <ew.gemini at nassur.net> wrote:
> 
> might have been caused by a crawler.

There you go. Dysfunction cuts both way. Always be defensive. It's a 
hostile environment out there.

Link to individual message.

20. Philip Linde (linde.philip (a) gmail.com)

On Wed, 16 Dec 2020 16:05:50 +0100
Stephane Bortzmeyer <stephane at sources.org> wrote:

> I'm running a Gemini crawler, which gathers metadata about the
> geminispace. The goal is not to make a search engine but to survey the
> geminispace.

That's interesting, Stephane. Could you add statistics about character
encodings used for text/gemini responses specifically? I'd like to know
if there are currently text/gemini responses in any other encoding than
UTF-8 (or US ASCII). That would be an interesting topic in the IRI+IDN
discussion.

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201224/1204
e38c/attachment.sig>

Link to individual message.

21. Sean Conner (sean (a) conman.org)

It was thus said that the Great Philip Linde once stated:
> On Wed, 16 Dec 2020 16:05:50 +0100
> Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> > I'm running a Gemini crawler, which gathers metadata about the
> > geminispace. The goal is not to make a search engine but to survey the
> > geminispace.
> 
> That's interesting, Stephane. Could you add statistics about character
> encodings used for text/gemini responses specifically? I'd like to know
> if there are currently text/gemini responses in any other encoding than
> UTF-8 (or US ASCII). That would be an interesting topic in the IRI+IDN
> discussion.

  There's a chart on the GUS stats page:

	https://portal.mozz.us/gemini/gus.guru/statistics

  It seesm it's a 54/46 split between UTF-8/US-ASCII (and 7 (seven) pages
out of 84,400 that are NOT UTF-8 nor US-ASCII).

  -spc

Link to individual message.

22. Stephane Bortzmeyer (stephane (a) sources.org)

On Thu, Dec 24, 2020 at 02:08:57AM +0100,
 Philip Linde <linde.philip at gmail.com> wrote 
 a message of 37 lines which said:

> Could you add statistics about character encodings used for
> text/gemini responses specifically?

Only for text/gemini:



But wait, all the exotic charsets are at <gemini://egsam.pitr.ca/>
which is a test site for various funny stuff. So, it is safe to say
that not one "real" gemtext resource uses something else than UTF-8.

By the way, this is the RFC 5198 recommendation
<gemini://gemini.bortzmeyer.org/rfc-mirror/rfc5198.txt>

> I'd like to know if there are currently text/gemini responses in any
> other encoding than UTF-8 (or US ASCII). That would be an
> interesting topic in the IRI+IDN discussion.

I don't see the relationship. There is clearly unanimity among
geminauts that *content* should be in UTF-8 (and I would suggest that
this SHOULD could be changed in MUST), the discussion is about
metadata, the identifier (the IRI).

Link to individual message.

23. Stephane Bortzmeyer (stephane (a) sources.org)

On Wed, Dec 23, 2020 at 09:01:13PM -0500,
 Sean Conner <sean at conman.org> wrote 
 a message of 22 lines which said:

>   There's a chart on the GUS stats page:
> 
> 	https://portal.mozz.us/gemini/gus.guru/statistics
> 
>   It seesm it's a 54/46 split between UTF-8/US-ASCII

It seems this percentage includes plain text (anyway, the sum of the
numbers does not match the total, not the gemtext only). I cannot find a
gemtext page tagged as US-ASCII.

Link to individual message.

24. Stephane Bortzmeyer (stephane (a) sources.org)

On Sat, Dec 19, 2020 at 07:55:05PM +0100,
 Solderpunk <solderpunk at posteo.net> wrote 
 a message of 22 lines which said:

> I would be curious to see more thorough statistics on TLS certificates
> in Geminispace.  It's nice that you have the Let's Encrypt percentage on
> there, but I'd also like to know about self-signed certificates, what
> the distribution of sizes, key types,

Now done at <gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi>.

Link to individual message.

25. Stephane Bortzmeyer (stephane (a) sources.org)

On Fri, Dec 18, 2020 at 12:12:47PM +0000,
 Luke Emmet <luke at marmaladefoo.com> wrote 
 a message of 23 lines which said:

> > gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi

> Could it be possible to show the distribution of page sizes in geminispace?
> I know you show the average page size, but to get a better view of what is
> typical and the range would be good. For example does it follow a power law
> etc...

Now displays fixed-size ranges *and* quantiles.

Link to individual message.

26. Luke Emmet (luke (a) marmaladefoo.com)



On 20-Feb-2021 14:53, Stephane Bortzmeyer wrote:
> On Fri, Dec 18, 2020 at 12:12:47PM +0000,
>   Luke Emmet <luke at marmaladefoo.com> wrote
>   a message of 23 lines which said:
>>> gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi
> Now displays fixed-size ranges *and* quantiles.

Thank you Stephane - it is interesting to see the shape of the 
geminiverse resources. It also helps to tune some typical client default 
parameters for max resource size before abandoning a client connection - 
as we know there is no Content-Length to know how much content to expect.

I know it is cheeky to keep coming with new suggestions - but it would 
be handy to know some time what is the shape of the predominant gemini 
resource - text/gemini. I assume that currently the stats apply to all 
resources, so may be skewed up due to binary files etc.

Regards

 ?- Luke

Link to individual message.

27. Stephane Bortzmeyer (stephane (a) sources.org)

On Sat, Feb 20, 2021 at 06:30:19PM +0000,
 Luke Emmet <luke at marmaladefoo.com> wrote 
 a message of 22 lines which said:

> I know it is cheeky to keep coming with new suggestions

Developers love requests!

> but it would be handy to know some time what is the shape of the
> predominant gemini resource - text/gemini. I assume that currently
> the stats apply to all resources, so may be skewed up due to binary
> files etc.

Just ask and its now done.

Link to individual message.

28. Luke Emmet (luke (a) marmaladefoo.com)



On 21-Feb-2021 17:23, Stephane Bortzmeyer wrote:
>
>> but it would be handy to know some time what is the shape of the
>> predominant gemini resource - text/gemini. I assume that currently
>> the stats apply to all resources, so may be skewed up due to binary
>> files etc.
> Just ask and its now done.
Thank you Stephane for adding more fine grained info about the 
text/gemini subset of resources. I think these statistics are helpful in 
understanding the broad shape of the geminiverse.

 ?- Luke

Link to individual message.

---

Previous Thread: Synchronizing bookmarks - Request for comments

Next Thread: [ANN] Gemini client list and reviews