💾 Archived View for rawtext.club › ~sloum › geminilist › 001113.gmi captured on 2020-09-24 at 02:06:40. Gemini links have been rewritten to link to archived content

View Raw

More Information

-=-=-=-=-=-=-

<-- back to the mailing list

The lang parameter to text/gemini

Natalie Pendragon natpen at natpen.net

Thu May 28 21:51:56 BST 2020

- - - - - - - - - - - - - - - - - - - 

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:

As far as I recall, nobody actually objected to this as something we
should do in principle, instead we just got distracted by various edge
cases. But I guess I may as well ask now: does anybody think this is a
*bad* idea?

Nope, I think it's a nice addition and not a bad idea at all! Lowextensibility, high value for the two use cases you described (screenreaders and search engines).

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:

I was, and am, opposed to putting a default language in the spec.

Agreed.

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:

The case of search engines is trickier, since their resulting database
does not have just one user but many. This was where autodetection
first came up, which some people seemed to get carried away with. Fully
generalised autodetection of language is computationally expensive and
it gives answers with some uncertainty. A large search engine project
*may* want to think about it - the idea of clients for humans users
doing it as a routine response to a lack of a lang parameter is nuts.

Agreed.

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:

A simpler option for search engines might simply be to interpret a user
request of "only show me results in languages X" as "don't show results
*known* to be in languages other than X". i.e documents for which the
language is not known are always possible search results. This is
imperfect, but, well, sometimes life is.
In short, I am not sure that the lack of specified default behaviour is
a good reason not to go ahead with this.

I agree with this in principle (i.e., as a guidepost for good userexperience in a search engine), but there can be complicating factorsin practice. In particular, some common and generally effective textindexing processes involve things like porter stemming words (so"stemmed" and "stemming" would both get indexed as something like"stem") and removal of "stop words" (and, in, the...). As you mightimagine, both of these operations are specific to language.

So, simply in creating an index of Geminispace, there might already bean assumed "default" language. In the case of GUS, this is English. Idon't stop GUS from indexing any non-English content currently, butthe quality of indexing is lower for other languages. Operations likethe above (porter stemming and removing stop words) will simply beno-ops.

And then the other side of this experience is that when a user typesin a search query, that also goes through the same process - the queryis porter stemmed, stripped of its stop words, then shuttled off tothe TF-IDF index to find and score the actual matches.

For what its worth, I do not believe any of what I've written here isan argument for adding a default language to the spec. That, to me,feels solidly outside the appropriate scope of the spec. But, if we'retalking search engines, there's probably going to end up being adefault language in practice for any search engine based on mainstreamfull-text search approaches.

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:

The second question was what to do when a document contains text in
multiple languages. This is a trickier question. I'd prefer not to
define a new line type to handle it. We could at least allow the lang
parameter to accept multiple values separated by some delimiter.

Agreed! I think the power-to-weight ratio of a line-specific langvalue is too low. Allowing multiple lang values at the document levelfeels like a nice balance though.

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:

There's also the question of directionality, which I think might require
a separate parameter entirely. But let's focus on the language thing
for now. How does the above sound to people?

What does directionality mean in this context?

In terms of how the above all sounds:- I support the addition of the document-level lang parameter, which can accept multiple values, to the spec.- I do NOT support the addition of a default document-level lang value to the spec.- I do NOT support the addition of a line-level lang parameter.

Natalie