💾 Archived View for rawtext.club › ~sloum › geminilist › 001117.gmi captured on 2020-09-24 at 02:06:36. Gemini links have been rewritten to link to archived content

View Raw

More Information

-=-=-=-=-=-=-

<-- back to the mailing list

The lang parameter to text/gemini

solderpunk solderpunk at SDF.ORG

Thu May 28 22:15:35 BST 2020

- - - - - - - - - - - - - - - - - - - 

On Thu, May 28, 2020 at 04:51:56PM -0400, Natalie Pendragon wrote:

I agree with this in principle (i.e., as a guidepost for good user
experience in a search engine), but there can be complicating factors
in practice. In particular, some common and generally effective text
indexing processes involve things like porter stemming words (so
"stemmed" and "stemming" would both get indexed as something like
"stem") and removal of "stop words" (and, in, the...). As you might
imagine, both of these operations are specific to language.
So, simply in creating an index of Geminispace, there might already be
an assumed "default" language. In the case of GUS, this is English. I
don't stop GUS from indexing any non-English content currently, but
the quality of indexing is lower for other languages. Operations like
the above (porter stemming and removing stop words) will simply be
no-ops.
And then the other side of this experience is that when a user types
in a search query, that also goes through the same process - the query
is porter stemmed, stripped of its stop words, then shuttled off to
the TF-IDF index to find and score the actual matches.

Thanks for shedding some light on the processing that happens behind thescenes in GUS! Language declaration is even more important to searchengines than I had realised. Once we've got this specced we will haveto really encourage the authors of servers to make it possible for usersto control this parameter, and to encourage the folks at non-Englishservers to use it!

What does directionality mean in this context?

Left-to-right vs right-to-left vs top-to-bottom, etc. I don't thinkthis will actually be relevant for search at all?

In terms of how the above all sounds:
- I support the addition of the document-level lang parameter, which
can accept multiple values, to the spec.
- I do NOT support the addition of a default document-level lang value
to the spec.
- I do NOT support the addition of a line-level lang parameter.

Thanks for the nice, clear summary!

Cheers,Solderpunk