Good practices regarding MIME type

Solène Rapenne <solene (a) perso.pw>

Hi,

I wrote a gemini server in C and I currently use an hardcoded list of 
file extensions <-> MIME type assocation.
This isn't great because it relies on file extension which can be wrong, 
but a file without extension would
use a default.

I chose to set a default text/gemini in case the extension is unknown or 
if the file has no extension.

What are the good practices to determine a file MIME type?

regards
Sol?ne

Link to individual message.

John Cowan <cowan (a) ccil.org>

You can use the file(1) command or its library libmagic.  These are
excellent for binary files and support some text file formats, and are
extensible.  There are libmagic bindings for at least Perl, Python, Go,
Rust, Lua, Common Lisp, and Chicken Scheme.

On Thu, Dec 10, 2020 at 4:12 PM Sol?ne Rapenne <solene at perso.pw> wrote:

> Hi,
>
> I wrote a gemini server in C and I currently use an hardcoded list of
> file extensions <-> MIME type assocation.
> This isn't great because it relies on file extension which can be wrong,
> but a file without extension would
> use a default.
>
> I chose to set a default text/gemini in case the extension is unknown or
> if the file has no extension.
>
> What are the good practices to determine a file MIME type?
>
> regards
> Sol?ne
>

Link to individual message.

Omar Polo <op (a) omarpolo.com>


Sol?ne Rapenne <solene at perso.pw> writes:

> Hi,
>
> I wrote a gemini server in C and I currently use an hardcoded list of
> file extensions <-> MIME type assocation.
> This isn't great because it relies on file extension which can be
> wrong, but a file without extension would
> use a default.
>
> I chose to set a default text/gemini in case the extension is unknown
> or if the file has no extension.
>
> What are the good practices to determine a file MIME type?
>
> regards
> Sol?ne

I'm using the same approach in my server, but there are two alternatives
I know:

 - using /usr/share/misc/mime.types (still a list, but probably more
   complete than a manual one).  Don't know if it's widespread, but
   it's present in base on OpenBSD :)
 - using libmagic: it's a library to detect the MIME type by reading the
   file.  it powers the file(1) command on some unices.  The drawback is
   that it needs to open and read the file, whereas guessing from the
   extension doesn't.

Link to individual message.

William Orr <will (a) worrbase.com>

Hey,

Afaik, OpenBSD doesn't ship a libmagic implementation by default, but it 
does ship a version of file(1) as well as a magic(5) db that you can look 
at. If you look at the source of the file command, you might be able to 
work out how to make use of the file(5) db or just lift the code from there.

Alternatively, libmagic is in ports.

Hope that helps!

Link to individual message.

Solène Rapenne <solene (a) perso.pw>

Le 2020-12-10 22:26, Omar Polo a ?crit?:
> Sol?ne Rapenne <solene at perso.pw> writes:
> 
>> Hi,
>> 
>> I wrote a gemini server in C and I currently use an hardcoded list of
>> file extensions <-> MIME type assocation.
>> This isn't great because it relies on file extension which can be
>> wrong, but a file without extension would
>> use a default.
>> 
>> I chose to set a default text/gemini in case the extension is unknown
>> or if the file has no extension.
>> 
>> What are the good practices to determine a file MIME type?
>> 
>> regards
>> Sol?ne
> 
> I'm using the same approach in my server, but there are two 
> alternatives
> I know:
> 
>  - using /usr/share/misc/mime.types (still a list, but probably more
>    complete than a manual one).  Don't know if it's widespread, but
>    it's present in base on OpenBSD :)
>  - using libmagic: it's a library to detect the MIME type by reading 
> the
>    file.  it powers the file(1) command on some unices.  The drawback 
> is
>    that it needs to open and read the file, whereas guessing from the
>    extension doesn't.

I already did use that exact mime.types file, but I hardcoded it. I will
take a look at file(1) code. I target OpenBSD first but I'll see if
it can be ported easily.

Link to individual message.

John Cowan <cowan (a) ccil.org>

libmagic uses the local mime-types file(s) but is conditionalized to know
where they are on different operating systems, so it's better to use it,
even if it is not installed by default.

On Thu, Dec 10, 2020 at 4:26 PM Omar Polo <op at omarpolo.com> wrote:

>
> Sol?ne Rapenne <solene at perso.pw> writes:
>
> > Hi,
> >
> > I wrote a gemini server in C and I currently use an hardcoded list of
> > file extensions <-> MIME type assocation.
> > This isn't great because it relies on file extension which can be
> > wrong, but a file without extension would
> > use a default.
> >
> > I chose to set a default text/gemini in case the extension is unknown
> > or if the file has no extension.
> >
> > What are the good practices to determine a file MIME type?
> >
> > regards
> > Sol?ne
>
> I'm using the same approach in my server, but there are two alternatives
> I know:
>
>  - using /usr/share/misc/mime.types (still a list, but probably more
>    complete than a manual one).  Don't know if it's widespread, but
>    it's present in base on OpenBSD :)
>  - using libmagic: it's a library to detect the MIME type by reading the
>    file.  it powers the file(1) command on some unices.  The drawback is
>    that it needs to open and read the file, whereas guessing from the
>    extension doesn't.
>

Link to individual message.

colecmac@protonmail.com <colecmac (a) protonmail.com>

> I chose to set a default text/gemini in case the extension is unknown or
> if the file has no extension.

This is not a good idea for any unrecognized file, extension or not. If you
know the file is UTF-8 text, serve it as "text/plain", otherwise you should
serve it as "application/octet-stream", indicating a generic binary file.

Jetforce used to default to text/plain for all files[1], and it was a problem
because clients will try to display binary data as text, resulting in garbled
data. For example, try running `cat /dev/urandom` and see how that looks.

1: https://github.com/michael-lazar/jetforce/issues/38#issuecomment-659688602


Cheers,
makeworld

P.S. The more accurate term is media type, not MIME type. See
https://www.iana.org/assignments/media-types/media-types.xhtml, or
just Wikipedia :)

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Sol?ne Rapenne once stated:
> Hi,
> 
> I wrote a gemini server in C and I currently use an hardcoded list of 
> file extensions <-> MIME type assocation.
> This isn't great because it relies on file extension which can be wrong, 
> but a file without extension would
> use a default.
> 
> I chose to set a default text/gemini in case the extension is unknown or 
> if the file has no extension.
> 
> What are the good practices to determine a file MIME type?

  For my server [1] you can configure a mapping of extensions to MIME type. 
If a file's extension isn't found in that mapping, then I use libmagic to
determine the MIME type.

  -spc

[1]	https://github.com/spc476/GLV-1.12556

Link to individual message.

Philip Linde <linde.philip (a) gmail.com>

On Thu, 10 Dec 2020 22:12:41 +0100
Sol?ne Rapenne <solene at perso.pw> wrote:

> Hi,
> 
> I wrote a gemini server in C and I currently use an hardcoded list of 
> file extensions <-> MIME type assocation.
> This isn't great because it relies on file extension which can be wrong, 
> but a file without extension would
> use a default.
> 
> I chose to set a default text/gemini in case the extension is unknown or 
> if the file has no extension.
> 
> What are the good practices to determine a file MIME type?

There is no way that will work completely reliably without implementing
full parsers of the different file types. For a complete server I'd
expect to be able to determine file type for a certain served resource
myself without relying on an extension-type mapping. For example, to be
able to say that every file under /text/ is text/plain.

John Cowan suggests libmagic and file. AFAIK utilities/libraries like
this can operate using matching rules on some-few bytes of a file to
determine the file type with a limited degree of accuracy

I suggest a procedure like this to determine the file type:

1. Check if there is a configuration rule for this particular file to
   determine its file type. If so, use that.
2. If there is not, check if the server extension-type mapping
   configuration contains the file extension. If so, use that.
3. If there is not, check the system level mime type database if there
   is a type assigned to the extension. If so, use that.
4. If there is not, you can now optionally use some heuristic approach
   to determine the file type. This can be via a library like
   libmagic, or a simpler approach as suggested by makeworld to
   determine whether you can defer to text/plain or not. If so, use
   that.
5. If not, assume application/octet-stream and use that.

You could cache the results in memory and drop the cache on e.g. SIGHUP

As for extension-less files, if you don't want extensions visible to
the client, you can still use extensions on the server side, which the
server optionally strips off.

Overall I think it's fair to expect some level of effort from the
server operator in making sure that the static files have sensible
extensions. Any smartness beyond extension mapping is at best a bonus
AFAIC, at worst a potentially nasty surprise.

-- 
Philip

Link to individual message.

Jason McBrayer <jmcbray (a) carcosa.net>

Sean Conner <sean at conman.org> writes:

>   For my server [1] you can configure a mapping of extensions to MIME
> type. If a file's extension isn't found in that mapping, then I use
> libmagic to determine the MIME type.

My server (Germinal) does the same thing, largely because that's what
the MIME library I'm using does by default.

-- 
Jason McBrayer      | ?Strange is the night where black stars rise,
jmcbray at carcosa.net | and strange moons circle through the skies,
                    | but stranger still is lost Carcosa.?
                    | ? Robert W. Chambers,The King in Yellow

Link to individual message.

---

Previous Thread: Some reading on IRIs and IDNs

Next Thread: [ANN] Castor9 gemini browser for Plan 9