💾 Archived View for gemi.dev › gemini-mailing-list › 000443.gmi captured on 2023-11-04 at 12:49:22. Gemini links have been rewritten to link to archived content

View Raw

More Information

➡️ Next capture (2023-12-28)

-=-=-=-=-=-=-

libmagic

John Cowan <cowan (a) ccil.org>

This post is to suggest that servers currently using file extensions to
determine MIME-types switch to libmagic.  This C library analyzes the
content of a file (by name, by file descriptor, or by looking at a buffer
containing the content) and can provide a MIME-type and an encoding.  This
library is behind the `file` command on Linux, FreeBSD, and NetBSD (but not
OpenBSD), and there are interfaces for at least Python, Rust, and Go, plus
a version for Windows.  If you are testing (or serving!) on a Mac, use
Homebrew or Guix.  Googling for "libmagic" and some keyword will probably
find more.

Obviously, using libmagic is slower than just comparing a file extension to
a list of known extensions.  But Gemini servers are not, in general,
high-volume, and it has the advantage of being maintained by an outside
group that is quite good about accepting information about new file
formats.  This means that Gemini servers can serve content in most formats
without a problem.

Unfortunately, at the moment `file --mime`  will report either "text/plain;
charset=utf-8" or "text/plain; charset=us-ascii" instead of text/gemini.
So an interesting question is: how can a text/gemini file best be
identified by its content? It doesn't have to be infallible, because it can
be backed up by checking the extension.


John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
The native charset of SMS messages supports English, French, mainland
Scandinavian languages, German, Italian, Spanish with no accents, and
GREEK SHOUTING.  Everything else has to be Unicode, which means you get
only 70 16-bit characters in a text instead of 160 7-bit characters.

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great John Cowan once stated:
> This post is to suggest that servers currently using file extensions to
> determine MIME-types switch to libmagic.  

  The first Gemini server, GLV-1.12556, uses libmagic, and always has.  I
consider that an implementation detail.

> Unfortunately, at the moment `file --mime`  will report either "text/plain;
> charset=utf-8" or "text/plain; charset=us-ascii" instead of text/gemini.
> So an interesting question is: how can a text/gemini file best be
> identified by its content? It doesn't have to be infallible, because it can
> be backed up by checking the extension.

  GLV-1.12556 allows one to specify MIME type by extension; if no such
information is available, it will then fall back to libmagic for the MIME
type.  You can also spcify such an extension mapping for the entire server,
per host, per directory or per file [1].

  -spc (You can check the sample-conf.lua file for more information [2])
  
[1]	It also cascades---the various levels are merged at configuration
	file to avoid processing overhead during normal operations.

[2]	https://github.com/spc476/GLV-1.12556

Link to individual message.

John Cowan <cowan (a) ccil.org>

On Fri, Nov 6, 2020 at 8:08 PM Sean Conner <sean at conman.org> wrote:


>   The first Gemini server, GLV-1.12556, uses libmagic, and always has.  I
> consider that an implementation detail.
>

Absolutely.  But lots of people don't know about libmagic, and it is not
mentioned in the Best Practices document, whereas extension processing *is*
mentioned.  This is an environment where we will probably continue to have
lots of servers.
Any ideas for identifying text/gemini by regex?



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
We pledge allegiance to the penguin and to the intellectual property
regime for which he stands, one world under Linux, with free music
and open source software for all.  --Julian Dibbell on Brazil, edited

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great John Cowan once stated:
> On Fri, Nov 6, 2020 at 8:08 PM Sean Conner <sean at conman.org> wrote:
> 
> 
> >   The first Gemini server, GLV-1.12556, uses libmagic, and always has.  I
> > consider that an implementation detail.
> >
> 
> Absolutely.  But lots of people don't know about libmagic, and it is not
> mentioned in the Best Practices document, whereas extension processing *is*
> mentioned.  This is an environment where we will probably continue to have
> lots of servers.
> Any ideas for identifying text/gemini by regex?

  /^\=\>/ anywhere in the file?

  -spc

Link to individual message.

Philip Linde <linde.philip (a) gmail.com>

On Fri, 6 Nov 2020 20:21:50 -0500
John Cowan <cowan at ccil.org> wrote:

> Any ideas for identifying text/gemini by regex?

There are none that aren't likely to produce either false positives or
false negatives. Consider this valid, plausibly realistic text/gemini
document (indented):

  # Hello
  
  This is the first paragraph.
  
  This is the second.

  * List item 1
  * List item 2
  
  ## Subsection 1
  
  This is a subsection.

  ```
  This is some pre-formatted text
  ```

This is also a valid text/markdown and text/plain document. I'd say
that the most easily identifiable trait of text-gemini is the link
arrows, but plenty of documents contain no links, and (less likely)
examples that would generate false positives/negatives can still be
created.

IMO, solutions like libmagic should be used as a last resort. Let the
server admin associate file types with extensions (for example via a
system provided MIME database, or in a cascading fashion starting with
manual associations in the server configuration file, and using the MIME
database if that fails). If no such association exists, by all means
utilize libmagic and hope for the best.

-- 
Philip

Link to individual message.

---

Previous Thread: [ANN] Announcing cyberpunksin.space

Next Thread: hello