💾 Archived View for gemi.dev › gemini-mailing-list › 000443.gmi captured on 2023-11-04 at 12:49:22. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
This post is to suggest that servers currently using file extensions to determine MIME-types switch to libmagic. This C library analyzes the content of a file (by name, by file descriptor, or by looking at a buffer containing the content) and can provide a MIME-type and an encoding. This library is behind the `file` command on Linux, FreeBSD, and NetBSD (but not OpenBSD), and there are interfaces for at least Python, Rust, and Go, plus a version for Windows. If you are testing (or serving!) on a Mac, use Homebrew or Guix. Googling for "libmagic" and some keyword will probably find more. Obviously, using libmagic is slower than just comparing a file extension to a list of known extensions. But Gemini servers are not, in general, high-volume, and it has the advantage of being maintained by an outside group that is quite good about accepting information about new file formats. This means that Gemini servers can serve content in most formats without a problem. Unfortunately, at the moment `file --mime` will report either "text/plain; charset=utf-8" or "text/plain; charset=us-ascii" instead of text/gemini. So an interesting question is: how can a text/gemini file best be identified by its content? It doesn't have to be infallible, because it can be backed up by checking the extension. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org The native charset of SMS messages supports English, French, mainland Scandinavian languages, German, Italian, Spanish with no accents, and GREEK SHOUTING. Everything else has to be Unicode, which means you get only 70 16-bit characters in a text instead of 160 7-bit characters.
It was thus said that the Great John Cowan once stated: > This post is to suggest that servers currently using file extensions to > determine MIME-types switch to libmagic. The first Gemini server, GLV-1.12556, uses libmagic, and always has. I consider that an implementation detail. > Unfortunately, at the moment `file --mime` will report either "text/plain; > charset=utf-8" or "text/plain; charset=us-ascii" instead of text/gemini. > So an interesting question is: how can a text/gemini file best be > identified by its content? It doesn't have to be infallible, because it can > be backed up by checking the extension. GLV-1.12556 allows one to specify MIME type by extension; if no such information is available, it will then fall back to libmagic for the MIME type. You can also spcify such an extension mapping for the entire server, per host, per directory or per file [1]. -spc (You can check the sample-conf.lua file for more information [2]) [1] It also cascades---the various levels are merged at configuration file to avoid processing overhead during normal operations. [2] https://github.com/spc476/GLV-1.12556
On Fri, Nov 6, 2020 at 8:08 PM Sean Conner <sean at conman.org> wrote: > The first Gemini server, GLV-1.12556, uses libmagic, and always has. I > consider that an implementation detail. > Absolutely. But lots of people don't know about libmagic, and it is not mentioned in the Best Practices document, whereas extension processing *is* mentioned. This is an environment where we will probably continue to have lots of servers. Any ideas for identifying text/gemini by regex? John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org We pledge allegiance to the penguin and to the intellectual property regime for which he stands, one world under Linux, with free music and open source software for all. --Julian Dibbell on Brazil, edited
It was thus said that the Great John Cowan once stated: > On Fri, Nov 6, 2020 at 8:08 PM Sean Conner <sean at conman.org> wrote: > > > > The first Gemini server, GLV-1.12556, uses libmagic, and always has. I > > consider that an implementation detail. > > > > Absolutely. But lots of people don't know about libmagic, and it is not > mentioned in the Best Practices document, whereas extension processing *is* > mentioned. This is an environment where we will probably continue to have > lots of servers. > Any ideas for identifying text/gemini by regex? /^\=\>/ anywhere in the file? -spc
On Fri, 6 Nov 2020 20:21:50 -0500 John Cowan <cowan at ccil.org> wrote: > Any ideas for identifying text/gemini by regex? There are none that aren't likely to produce either false positives or false negatives. Consider this valid, plausibly realistic text/gemini document (indented): # Hello This is the first paragraph. This is the second. * List item 1 * List item 2 ## Subsection 1 This is a subsection. ``` This is some pre-formatted text ``` This is also a valid text/markdown and text/plain document. I'd say that the most easily identifiable trait of text-gemini is the link arrows, but plenty of documents contain no links, and (less likely) examples that would generate false positives/negatives can still be created. IMO, solutions like libmagic should be used as a last resort. Let the server admin associate file types with extensions (for example via a system provided MIME database, or in a cascading fashion starting with manual associations in the server configuration file, and using the MIME database if that fails). If no such association exists, by all means utilize libmagic and hope for the best. -- Philip
---