<-- back to the mailing list

Documents with mixed languages

Michael Lazar lazar.michael22 at gmail.com

Sun Dec 12 14:46:15 GMT 2021

- - - - - - - - - - - - - - - - - - - 

On Sun, Dec 12, 2021 at 5:48 AM Stephane Bortzmeyer<stephane at sources.org> wrote:

On Sat, Dec 11, 2021 at 01:06:25PM -0500,
Michael Lazar <lazar.michael22 at gmail.com> wrote
a message of 75 lines which said:
Your best bet, if you're serious about this, is to go ahead and
implement your proposal in your client. If others find it useful
they will start using it too
I'm not sure it is something to recommend since it can leads to "de
facto" standards and to "best viewed with client XYZ
7" which were
one of the reasons we ran away from the Web.

Speak for yourself. De-facto standards are social proof that a subsetof the community actually wants and will use a feature. Which is muchmore convincing to me than a loud minority arguing for (or against)something based on principle alone.

This doesn't allow for mixed languages inside of a single
line/paragraph though.
Unicode has a solution, but its use is discouraged
<http://unicode.org/faq/languagetagging.html>. "Most other users who
need to tag text with the language identity should be using standard
markup mechanisms, such as those provided by HTML, XML, or other rich
text mechanisms."

This is super interesting! I wonder what "deprecated" means forunicode, surely they wouldn't release a backwards incompatibleversion.Their implementation guidelines systematically refute the argument forlanguage tags.

Requirements for Language Tagging

The requirement for language information embedded in plain text datais often overstated. Many commonplace operations such as collationseldom require this extra information. In collation, for example,foreign language text is generally collated as if it were not in aforeign language. (See Unicode Technical Standard #10, “UnicodeCollation Algorithm,” for more information.) For example, an index inan English book would not sort the Slovak word “chlieb” after “czar,”where it would be collated in Slovak, nor would an English atlas putthe Swedish city of Örebro after Zanzibar, where it would appear inSwedish.

Text to speech is also an area where the case for embedded languageinformation is overstated. Although language information may be usefulin performing text-to-speech operations, modern software for doingacceptable text-to-speech must be so sophisticated in performinggrammatical analysis of text that the extra work in determining thelanguage is not significant in practice.

Language information can be useful in certain operations, such asspell-checking or hyphenating a mixed-language document. It is alsouseful in choosing the default font for a run of unstyled text; forexample, the ellipsis character may have a very different appearancein Japanese fonts than in European fonts. Modern font and layouttechnologies produce different results based on language information.For example, the angle of the acute accent may be different for Frenchand Polish.

Language Tags and Han Unification

A common misunderstanding about Unicode Han unification is themistaken belief that Han characters cannot be rendered properlywithout language information. This idea might lead an implementer toconclude that language information must always be added to plain textusing the tags. However, this implication is incorrect. The goal andmethods of Han unification were to ensure that the text remainedlegible. Although font, size, width, and other format specificationsneed to be added to produce precisely the same appearance on thesource and target machines, plain text remains legible in the absenceof these specifications. There should never be any confusion inUnicode, because the distinctions between the unified characters areall within the range of stylistic variations that exist in eachcountry. No unification in Unicode should make it impossible for areader to identify a character if it appears in a different font.Where precise font information is important, it is best conveyed in arich text format.