💾 Archived View for gemi.dev › gemini-mailing-list › 000157.gmi captured on 2024-05-26 at 15:23:46. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-12-28)

-=-=-=-=-=-=-

The lang parameter to text/gemini

1. solderpunk (solderpunk (a) SDF.ORG)

Ahoy!

Let's pick this issue up again, in its own thread this time.

My original proposal was that we add a new parameter to the text/gemini
media type to specify the human language a document is written in.
Following the lead of RFC1766, the parameter would be called "lang" and
take values based on ISO 639 language codes and ISO 3166 country codes.

As far as I recall, nobody actually objected to this as something we
should do in principle, instead we just got distracted by various edge
cases.  But I guess I may as well ask now: does anybody think this is a


The two concerete motivations for adding this were:

1. Screenreaders need to know this information to know which settings to
use for their text-to-speech engine: the same letters correspond to
different sounds in different languages.

2. Search engines may want to to offer their users the ability to ask
for results only in a particular set of languages.

Can anybody think of additional likey use cases besides these?

Since these are the main motivations, that also means that "normal
clients" (i.e. for use by sighted human users) have minimal use for this
information and can more or less ignore it.  So, in considering the edge
cases that came up, we should be thinking about screenreaders and search
engines, not the stuff that most people here are presumably using day to
day.

The first question was what to do if the parameter is not specified.

I was, and am, opposed to putting a default language in the spec.

In the case of a screenreader, it seems entirely sensible to me that the
user of any such screenreader should be able to specify their own
default based on their primary reading languages, and that the software
should make it easy to change this when it is clear there is a problem.
It's not really the Gemini spec's job to say anything about this.

The case of search engines is trickier, since their resulting database
does not have just one user but many.  This was where autodetection
first came up, which some people seemed to get carried away with.  Fully
generalised autodetection of language is computationally expensive and
it gives answers with some uncertainty.  A large search engine project

doing it as a routine response to a lack of a lang parameter is nuts.

A simpler option for search engines might simply be to interpret a user
request of "only show me results in languages X" as "don't show results

language is not known are always possible search results.  This is
imperfect, but, well, sometimes life is.

In short, I am not sure that the lack of specified default behaviour is
a good reason not to go ahead with this.

The second question was what to do when a document contains text in
multiple languages.  This is a trickier question.  I'd prefer not to
define a new line type to handle it.  We could at least allow the lang
parameter to accept multiple values separated by some delimiter.  It
wouldn't be clear from that which parts were what, but it could at least
act as a strong hint to screenreaders.  Search engines could include
such pages in results if any of the delcared languages matched one the
user had requested.  Actually, perhaps that's a perfectly adequate
solution, in which case this is not trickier at all.

There's also the question of directionality, which I think might require
a separate parameter entirely.  But let's focus on the language thing
for now.  How does the above sound to people?

Cheers,
Solderpunk

Link to individual message.

2. Nicole Mazzuca (nicole (a) strega-nil.co)

I would say, generally, that the base directionality of text is given by 
the script one is using, which is defined by the language tag. A language 
has a default script (for en-US, that's Latin), and if someone wants to 
change their script, it's very easy to do so via the script part of the 
lang tag, for example, yi-US (which is shorthand for yi-Hebr-US, and is 
RTL) vs yi-Latn-US (LTR).

Nicole

On Thu, May 28, 2020 at 11:43, solderpunk <solderpunk at SDF.ORG> wrote:

> Ahoy!
>
> Let's pick this issue up again, in its own thread this time.
>
> My original proposal was that we add a new parameter to the text/gemini
> media type to specify the human language a document is written in.
> Following the lead of RFC1766, the parameter would be called "lang" and
> take values based on ISO 639 language codes and ISO 3166 country codes.
>
> As far as I recall, nobody actually objected to this as something we
> should do in principle, instead we just got distracted by various edge
> cases. But I guess I may as well ask now: does anybody think this is a
> *bad* idea?
>
> The two concerete motivations for adding this were:
>
> 1. Screenreaders need to know this information to know which settings to
> use for their text-to-speech engine: the same letters correspond to
> different sounds in different languages.
>
> 2. Search engines may want to to offer their users the ability to ask
> for results only in a particular set of languages.
>
> Can anybody think of additional likey use cases besides these?
>
> Since these are the main motivations, that also means that "normal
> clients" (i.e. for use by sighted human users) have minimal use for this
> information and can more or less ignore it. So, in considering the edge
> cases that came up, we should be thinking about screenreaders and search
> engines, not the stuff that most people here are presumably using day to
> day.
>
> The first question was what to do if the parameter is not specified.
>
> I was, and am, opposed to putting a default language in the spec.
>
> In the case of a screenreader, it seems entirely sensible to me that the
> user of any such screenreader should be able to specify their own
> default based on their primary reading languages, and that the software
> should make it easy to change this when it is clear there is a problem.
> It's not really the Gemini spec's job to say anything about this.
>
> The case of search engines is trickier, since their resulting database
> does not have just one user but many. This was where autodetection
> first came up, which some people seemed to get carried away with. Fully
> generalised autodetection of language is computationally expensive and
> it gives answers with some uncertainty. A large search engine project
> *may* want to think about it - the idea of clients for humans users
> doing it as a routine response to a lack of a lang parameter is nuts.
>
> A simpler option for search engines might simply be to interpret a user
> request of "only show me results in languages X" as "don't show results
> *known* to be in languages other than X". i.e documents for which the
> language is not known are always possible search results. This is
> imperfect, but, well, sometimes life is.
>
> In short, I am not sure that the lack of specified default behaviour is
> a good reason not to go ahead with this.
>
> The second question was what to do when a document contains text in
> multiple languages. This is a trickier question. I'd prefer not to
> define a new line type to handle it. We could at least allow the lang
> parameter to accept multiple values separated by some delimiter. It
> wouldn't be clear from that which parts were what, but it could at least
> act as a strong hint to screenreaders. Search engines could include
> such pages in results if any of the delcared languages matched one the
> user had requested. Actually, perhaps that's a perfectly adequate
> solution, in which case this is not trickier at all.
>
> There's also the question of directionality, which I think might require
> a separate parameter entirely. But let's focus on the language thing
> for now. How does the above sound to people?
>
> Cheers,
> Solderpunk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200528/0330
b10c/attachment.htm>

Link to individual message.

3. solderpunk (solderpunk (a) SDF.ORG)

On Thu, May 28, 2020 at 08:02:34PM +0000, Nicole Mazzuca wrote:
> I would say, generally, that the base directionality of text is given by 
the script one is using, which is defined by the language tag. A language 
has a default script (for en-US, that's Latin), and if someone wants to 
change their script, it's very easy to do so via the script part of the 
lang tag, for example, yi-US (which is shorthand for yi-Hebr-US, and is 
RTL) vs yi-Latn-US (LTR).

Ah, I wasn't aware of this, thank you!  Shufei, if you're reading, do
you know if this addresses the concerns you've voiced to me previously
about vertical rendering of traditional Chinese?  Or is the problem that
there's just no standardised way to denote that, rather than a lack of
client support?

Cheers,
Solderpunk

Link to individual message.

4. Natalie Pendragon (natpen (a) natpen.net)

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:
> As far as I recall, nobody actually objected to this as something we
> should do in principle, instead we just got distracted by various edge
> cases.  But I guess I may as well ask now: does anybody think this is a
> *bad* idea?

Nope, I think it's a nice addition and not a bad idea at all! Low
extensibility, high value for the two use cases you described (screen
readers and search engines).

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:
> I was, and am, opposed to putting a default language in the spec.

Agreed.

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:
> The case of search engines is trickier, since their resulting database
> does not have just one user but many.  This was where autodetection
> first came up, which some people seemed to get carried away with.  Fully
> generalised autodetection of language is computationally expensive and
> it gives answers with some uncertainty.  A large search engine project
> *may* want to think about it - the idea of clients for humans users
> doing it as a routine response to a lack of a lang parameter is nuts.

Agreed.

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:
> A simpler option for search engines might simply be to interpret a user
> request of "only show me results in languages X" as "don't show results
> *known* to be in languages other than X".  i.e documents for which the
> language is not known are always possible search results.  This is
> imperfect, but, well, sometimes life is.
>
> In short, I am not sure that the lack of specified default behaviour is
> a good reason not to go ahead with this.

I agree with this in principle (i.e., as a guidepost for good user
experience in a search engine), but there can be complicating factors
in practice. In particular, some common and generally effective text
indexing processes involve things like porter stemming words (so
"stemmed" and "stemming" would both get indexed as something like
"stem") and removal of "stop words" (and, in, the...). As you might
imagine, both of these operations are specific to language.

So, simply in creating an index of Geminispace, there might already be
an assumed "default" language. In the case of GUS, this is English. I
don't stop GUS from indexing any non-English content currently, but
the quality of indexing is lower for other languages. Operations like
the above (porter stemming and removing stop words) will simply be
no-ops.

And then the other side of this experience is that when a user types
in a search query, that also goes through the same process - the query
is porter stemmed, stripped of its stop words, then shuttled off to
the TF-IDF index to find and score the actual matches.

For what its worth, I do not believe any of what I've written here is
an argument for adding a default language to the spec. That, to me,
feels solidly outside the appropriate scope of the spec. But, if we're
talking search engines, there's probably going to end up being a
default language in practice for any search engine based on mainstream
full-text search approaches.

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:
> The second question was what to do when a document contains text in
> multiple languages.  This is a trickier question.  I'd prefer not to
> define a new line type to handle it.  We could at least allow the lang
> parameter to accept multiple values separated by some delimiter.

Agreed! I think the power-to-weight ratio of a line-specific lang
value is too low. Allowing multiple lang values at the document level
feels like a nice balance though.

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:
> There's also the question of directionality, which I think might require
> a separate parameter entirely.  But let's focus on the language thing
> for now.  How does the above sound to people?

What does directionality mean in this context?

In terms of how the above all sounds:
- I support the addition of the document-level lang parameter, which
  can accept multiple values, to the spec.
- I do NOT support the addition of a default document-level lang value
  to the spec.
- I do NOT support the addition of a line-level lang parameter.

Natalie

Link to individual message.

5. solderpunk (solderpunk (a) SDF.ORG)

On Thu, May 28, 2020 at 04:51:56PM -0400, Natalie Pendragon wrote:

> I agree with this in principle (i.e., as a guidepost for good user
> experience in a search engine), but there can be complicating factors
> in practice. In particular, some common and generally effective text
> indexing processes involve things like porter stemming words (so
> "stemmed" and "stemming" would both get indexed as something like
> "stem") and removal of "stop words" (and, in, the...). As you might
> imagine, both of these operations are specific to language.
> 
> So, simply in creating an index of Geminispace, there might already be
> an assumed "default" language. In the case of GUS, this is English. I
> don't stop GUS from indexing any non-English content currently, but
> the quality of indexing is lower for other languages. Operations like
> the above (porter stemming and removing stop words) will simply be
> no-ops.
> 
> And then the other side of this experience is that when a user types
> in a search query, that also goes through the same process - the query
> is porter stemmed, stripped of its stop words, then shuttled off to
> the TF-IDF index to find and score the actual matches.

Thanks for shedding some light on the processing that happens behind the
scenes in GUS!  Language declaration is even more important to search
engines than I had realised.  Once we've got this specced we will have
to really encourage the authors of servers to make it possible for users
to control this parameter, and to encourage the folks at non-English
servers to use it!
 
> What does directionality mean in this context?

Left-to-right vs right-to-left vs top-to-bottom, etc.  I don't think
this will actually be relevant for search at all?
 
> In terms of how the above all sounds:
> - I support the addition of the document-level lang parameter, which
>   can accept multiple values, to the spec.
> - I do NOT support the addition of a default document-level lang value
>   to the spec.
> - I do NOT support the addition of a line-level lang parameter.

Thanks for the nice, clear summary!

Cheers,
Solderpunk

Link to individual message.

6. Natalie Pendragon (natpen (a) natpen.net)

On Thu, May 28, 2020 at 04:51:56PM -0400, Natalie Pendragon wrote:
> What does directionality mean in this context?

Ah, I understand directionality now from reading Nicole's response!
Left-to-right vs right-to-left languages.

Link to individual message.

---

Previous Thread: gemserv: Non-gemini files have mime type text/gemini

Next Thread: Some notes from the tildeverse