Ahoy! Let's pick this issue up again, in its own thread this time. My original proposal was that we add a new parameter to the text/gemini media type to specify the human language a document is written in. Following the lead of RFC1766, the parameter would be called "lang" and take values based on ISO 639 language codes and ISO 3166 country codes. As far as I recall, nobody actually objected to this as something we should do in principle, instead we just got distracted by various edge cases. But I guess I may as well ask now: does anybody think this is a
I would say, generally, that the base directionality of text is given by the script one is using, which is defined by the language tag. A language has a default script (for en-US, that's Latin), and if someone wants to change their script, it's very easy to do so via the script part of the lang tag, for example, yi-US (which is shorthand for yi-Hebr-US, and is RTL) vs yi-Latn-US (LTR). Nicole On Thu, May 28, 2020 at 11:43, solderpunk <solderpunk at SDF.ORG> wrote: > Ahoy! > > Let's pick this issue up again, in its own thread this time. > > My original proposal was that we add a new parameter to the text/gemini > media type to specify the human language a document is written in. > Following the lead of RFC1766, the parameter would be called "lang" and > take values based on ISO 639 language codes and ISO 3166 country codes. > > As far as I recall, nobody actually objected to this as something we > should do in principle, instead we just got distracted by various edge > cases. But I guess I may as well ask now: does anybody think this is a > *bad* idea? > > The two concerete motivations for adding this were: > > 1. Screenreaders need to know this information to know which settings to > use for their text-to-speech engine: the same letters correspond to > different sounds in different languages. > > 2. Search engines may want to to offer their users the ability to ask > for results only in a particular set of languages. > > Can anybody think of additional likey use cases besides these? > > Since these are the main motivations, that also means that "normal > clients" (i.e. for use by sighted human users) have minimal use for this > information and can more or less ignore it. So, in considering the edge > cases that came up, we should be thinking about screenreaders and search > engines, not the stuff that most people here are presumably using day to > day. > > The first question was what to do if the parameter is not specified. > > I was, and am, opposed to putting a default language in the spec. > > In the case of a screenreader, it seems entirely sensible to me that the > user of any such screenreader should be able to specify their own > default based on their primary reading languages, and that the software > should make it easy to change this when it is clear there is a problem. > It's not really the Gemini spec's job to say anything about this. > > The case of search engines is trickier, since their resulting database > does not have just one user but many. This was where autodetection > first came up, which some people seemed to get carried away with. Fully > generalised autodetection of language is computationally expensive and > it gives answers with some uncertainty. A large search engine project > *may* want to think about it - the idea of clients for humans users > doing it as a routine response to a lack of a lang parameter is nuts. > > A simpler option for search engines might simply be to interpret a user > request of "only show me results in languages X" as "don't show results > *known* to be in languages other than X". i.e documents for which the > language is not known are always possible search results. This is > imperfect, but, well, sometimes life is. > > In short, I am not sure that the lack of specified default behaviour is > a good reason not to go ahead with this. > > The second question was what to do when a document contains text in > multiple languages. This is a trickier question. I'd prefer not to > define a new line type to handle it. We could at least allow the lang > parameter to accept multiple values separated by some delimiter. It > wouldn't be clear from that which parts were what, but it could at least > act as a strong hint to screenreaders. Search engines could include > such pages in results if any of the delcared languages matched one the > user had requested. Actually, perhaps that's a perfectly adequate > solution, in which case this is not trickier at all. > > There's also the question of directionality, which I think might require > a separate parameter entirely. But let's focus on the language thing > for now. How does the above sound to people? > > Cheers, > Solderpunk
On Thu, May 28, 2020 at 08:02:34PM +0000, Nicole Mazzuca wrote: > I would say, generally, that the base directionality of text is given by the script one is using, which is defined by the language tag. A language has a default script (for en-US, that's Latin), and if someone wants to change their script, it's very easy to do so via the script part of the lang tag, for example, yi-US (which is shorthand for yi-Hebr-US, and is RTL) vs yi-Latn-US (LTR). Ah, I wasn't aware of this, thank you! Shufei, if you're reading, do you know if this addresses the concerns you've voiced to me previously about vertical rendering of traditional Chinese? Or is the problem that there's just no standardised way to denote that, rather than a lack of client support? Cheers, Solderpunk
On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote: > As far as I recall, nobody actually objected to this as something we > should do in principle, instead we just got distracted by various edge > cases. But I guess I may as well ask now: does anybody think this is a > *bad* idea? Nope, I think it's a nice addition and not a bad idea at all! Low extensibility, high value for the two use cases you described (screen readers and search engines). On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote: > I was, and am, opposed to putting a default language in the spec. Agreed. On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote: > The case of search engines is trickier, since their resulting database > does not have just one user but many. This was where autodetection > first came up, which some people seemed to get carried away with. Fully > generalised autodetection of language is computationally expensive and > it gives answers with some uncertainty. A large search engine project > *may* want to think about it - the idea of clients for humans users > doing it as a routine response to a lack of a lang parameter is nuts. Agreed. On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote: > A simpler option for search engines might simply be to interpret a user > request of "only show me results in languages X" as "don't show results > *known* to be in languages other than X". i.e documents for which the > language is not known are always possible search results. This is > imperfect, but, well, sometimes life is. > > In short, I am not sure that the lack of specified default behaviour is > a good reason not to go ahead with this. I agree with this in principle (i.e., as a guidepost for good user experience in a search engine), but there can be complicating factors in practice. In particular, some common and generally effective text indexing processes involve things like porter stemming words (so "stemmed" and "stemming" would both get indexed as something like "stem") and removal of "stop words" (and, in, the...). As you might imagine, both of these operations are specific to language. So, simply in creating an index of Geminispace, there might already be an assumed "default" language. In the case of GUS, this is English. I don't stop GUS from indexing any non-English content currently, but the quality of indexing is lower for other languages. Operations like the above (porter stemming and removing stop words) will simply be no-ops. And then the other side of this experience is that when a user types in a search query, that also goes through the same process - the query is porter stemmed, stripped of its stop words, then shuttled off to the TF-IDF index to find and score the actual matches. For what its worth, I do not believe any of what I've written here is an argument for adding a default language to the spec. That, to me, feels solidly outside the appropriate scope of the spec. But, if we're talking search engines, there's probably going to end up being a default language in practice for any search engine based on mainstream full-text search approaches. On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote: > The second question was what to do when a document contains text in > multiple languages. This is a trickier question. I'd prefer not to > define a new line type to handle it. We could at least allow the lang > parameter to accept multiple values separated by some delimiter. Agreed! I think the power-to-weight ratio of a line-specific lang value is too low. Allowing multiple lang values at the document level feels like a nice balance though. On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote: > There's also the question of directionality, which I think might require > a separate parameter entirely. But let's focus on the language thing > for now. How does the above sound to people? What does directionality mean in this context? In terms of how the above all sounds: - I support the addition of the document-level lang parameter, which can accept multiple values, to the spec. - I do NOT support the addition of a default document-level lang value to the spec. - I do NOT support the addition of a line-level lang parameter. Natalie
On Thu, May 28, 2020 at 04:51:56PM -0400, Natalie Pendragon wrote: > I agree with this in principle (i.e., as a guidepost for good user > experience in a search engine), but there can be complicating factors > in practice. In particular, some common and generally effective text > indexing processes involve things like porter stemming words (so > "stemmed" and "stemming" would both get indexed as something like > "stem") and removal of "stop words" (and, in, the...). As you might > imagine, both of these operations are specific to language. > > So, simply in creating an index of Geminispace, there might already be > an assumed "default" language. In the case of GUS, this is English. I > don't stop GUS from indexing any non-English content currently, but > the quality of indexing is lower for other languages. Operations like > the above (porter stemming and removing stop words) will simply be > no-ops. > > And then the other side of this experience is that when a user types > in a search query, that also goes through the same process - the query > is porter stemmed, stripped of its stop words, then shuttled off to > the TF-IDF index to find and score the actual matches. Thanks for shedding some light on the processing that happens behind the scenes in GUS! Language declaration is even more important to search engines than I had realised. Once we've got this specced we will have to really encourage the authors of servers to make it possible for users to control this parameter, and to encourage the folks at non-English servers to use it! > What does directionality mean in this context? Left-to-right vs right-to-left vs top-to-bottom, etc. I don't think this will actually be relevant for search at all? > In terms of how the above all sounds: > - I support the addition of the document-level lang parameter, which > can accept multiple values, to the spec. > - I do NOT support the addition of a default document-level lang value > to the spec. > - I do NOT support the addition of a line-level lang parameter. Thanks for the nice, clear summary! Cheers, Solderpunk
On Thu, May 28, 2020 at 04:51:56PM -0400, Natalie Pendragon wrote: > What does directionality mean in this context? Ah, I understand directionality now from reading Nicole's response! Left-to-right vs right-to-left languages.
---
Previous Thread: gemserv: Non-gemini files have mime type text/gemini